Compare commits
101 commits
runtime-su
...
master
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
58ac6edd7d | ||
|
|
19fd8799d9 | ||
|
|
7f17b65278 | ||
|
|
e6a2443412 | ||
|
|
f9b145585f | ||
|
|
3b620ef7e3 | ||
|
|
745e52723c | ||
|
|
1abe925f65 | ||
|
|
1c69a5bc29 | ||
|
|
02e7c28823 | ||
|
|
db592fbc28 | ||
|
|
00fc36df3a | ||
|
|
f5dcefc752 | ||
|
|
98437d46b2 | ||
|
|
5e97b4e448 | ||
|
|
ffb0608b9a | ||
|
|
f381023206 | ||
|
|
cb4ae756ab | ||
|
|
cfe5e02372 | ||
|
|
039f9f7247 | ||
|
|
495741e7ac | ||
|
|
43c5d45353 | ||
|
|
f64cec645e | ||
|
|
1db9db7d03 | ||
|
|
52607a7cdd | ||
|
|
b9ed118b8c | ||
|
|
bf1415e4c1 | ||
|
|
31b48d162a | ||
|
|
3499b2f280 | ||
|
|
f41ec5d0c5 | ||
|
|
20f6761a67 | ||
|
|
07bd498fd6 | ||
|
|
90c8e77bf7 | ||
|
|
ab8895d28b | ||
|
|
bd7f955e4e | ||
|
|
99200e6690 | ||
|
|
dcacac6965 | ||
|
|
e52b2e2259 | ||
|
|
5ccdfa0ca6 | ||
|
|
ff6fda1f04 | ||
|
|
ca37fca5ce | ||
|
|
1bbc511bb7 | ||
|
|
603e10a364 | ||
|
|
7277bdc27f | ||
|
|
b40b832159 | ||
|
|
28e9534765 | ||
|
|
46ae92b5c1 | ||
|
|
410bfe7065 | ||
|
|
b3912fe0ce | ||
|
|
61e07f4318 | ||
|
|
51002d4502 | ||
|
|
fb7828b52b | ||
|
|
2f1965733f | ||
|
|
267742c7d7 | ||
|
|
4e8968f9c7 | ||
|
|
f4a8db93e4 | ||
|
|
a5a3e223dc | ||
|
|
2349de518b | ||
|
|
65bac4ebfe | ||
|
|
96bf32614f | ||
|
|
ae33cce889 | ||
|
|
c5c080b3e3 | ||
|
|
01b7758fe6 | ||
|
|
7742bda245 | ||
|
|
98fe1f1846 | ||
|
|
beb8b5cbaa | ||
|
|
898deda05f | ||
|
|
f34399a30d | ||
|
|
9b39581b53 | ||
|
|
ae7446a04b | ||
|
|
f21be4f4d4 | ||
|
|
8fb4d3d634 | ||
|
|
35e57cc789 | ||
|
|
b02c8bb50e | ||
|
|
dc483ae31a | ||
|
|
9d2f748557 | ||
|
|
8a12b7ff17 | ||
|
|
f65698925e | ||
|
|
9f20dcae05 | ||
|
|
b7251ac416 | ||
|
|
807b097eb4 | ||
|
|
5754994f8e | ||
|
|
c299a2cb85 | ||
|
|
b129f03837 | ||
|
|
b7faac00c5 | ||
|
|
8f305ba3df | ||
|
|
c9ddfa9ac1 | ||
|
|
3233cf07cd | ||
|
|
ac90acfac8 | ||
|
|
12a775c834 | ||
|
|
41c05f42b5 | ||
|
|
e8d6d6d473 | ||
|
|
8d0f2379ba | ||
|
|
90b2a5d0e9 | ||
|
|
b726048d41 | ||
|
|
533b8e846d | ||
|
|
f4e6871d76 | ||
|
|
793559a4b5 | ||
|
|
0cf1106b34 | ||
|
|
2029457f57 | ||
|
|
8f5b905015 |
43
.claude/skills/deploy/SKILL.md
Normal file
43
.claude/skills/deploy/SKILL.md
Normal file
|
|
@ -0,0 +1,43 @@
|
|||
---
|
||||
name: deploy
|
||||
description: Deploy, redeploy, or ship homelab services to a target node. Trigger on any request containing deploy / redeploy / wdróż / zredeployuj / ship for targets control-plane, vps, piha, solaria, or chelsty-infra.
|
||||
---
|
||||
|
||||
Always invoke `scripts/deploy/deploy.sh <target> [--dry-run] [--no-gate]` as the **sole entry point**.
|
||||
Never call `deploy-control-plane.sh`, `deploy-node.sh`, or `deploy-local.sh` directly.
|
||||
|
||||
## Targets
|
||||
|
||||
| Target | What it deploys |
|
||||
|---|---|
|
||||
| `control-plane` | observer, supervisor, executor, operator-ui on VPS |
|
||||
| `vps` | all VPS GitOps services (node-agent, npm, outline, joplin, ai-cluster, …) |
|
||||
| `piha` | PIHA services (ha-diag-agent, node-agent, redis, …) |
|
||||
| `solaria` | SOLARIA compute services |
|
||||
| `chelsty-infra` | CHELSTY LTE edge node (30 s SSH timeout) |
|
||||
|
||||
## Invocation
|
||||
|
||||
```bash
|
||||
scripts/deploy/deploy.sh <target> # full pipeline
|
||||
scripts/deploy/deploy.sh <target> --dry-run # preflight + gate only
|
||||
scripts/deploy/deploy.sh <target> --no-gate # emergency: bypass tests
|
||||
```
|
||||
|
||||
## Exit Code Handling
|
||||
|
||||
| Code | Meaning | Required action |
|
||||
|---|---|---|
|
||||
| 0 | Success | Report: target, commit hash, gate status, verify status, elapsed time |
|
||||
| 1 | Preflight failed | Fix the upstream issue (push commits, wake node, switch to master). Never bypass. |
|
||||
| 2 | Gate failed | Show exactly which test/build failed. Do **not** deploy. Fix the failure first. |
|
||||
| 3 | Execute failed | Show full deploy output. Ask user whether to investigate or rollback. |
|
||||
| 4 | Verify failed | Show docker ps output. Discuss rollback with the user. |
|
||||
| 5 | Sudo handoff | Print the exact manual command from stderr **verbatim** and stop. User must run it. |
|
||||
|
||||
## Rules
|
||||
|
||||
- Never pass `--no-gate` unless the user explicitly requests emergency/bypass mode.
|
||||
- Never deploy uncommitted or unpushed code — preflight enforces this; do not help circumvent it.
|
||||
- Canonical branch is `master` — preflight enforces this.
|
||||
- For exit 5: reproduce the handoff command exactly as printed to stderr, then stop.
|
||||
65
.claude/skills/save-session/SKILL.md
Normal file
65
.claude/skills/save-session/SKILL.md
Normal file
|
|
@ -0,0 +1,65 @@
|
|||
---
|
||||
name: save-session
|
||||
description: Save and record the current work session to docs/sessions/. Trigger ONLY on explicit "save session", "zapisz sesję", or "wrap up" — never invoke proactively between tasks.
|
||||
---
|
||||
|
||||
**Trigger condition**: user explicitly says "save session", "zapisz sesję", "wrap up", or equivalent.
|
||||
Never invoke proactively. Never invoke mid-task.
|
||||
|
||||
## 1. Determine Session Boundary
|
||||
|
||||
1. Read the latest entry file in `docs/sessions/` — use its last `## Session HH:MM` heading timestamp as the start boundary.
|
||||
2. Fallback if no previous entry exists: 24 hours ago.
|
||||
|
||||
## 2. Collect Facts (deterministic only — no invention)
|
||||
|
||||
Run exactly:
|
||||
```bash
|
||||
# All commits since boundary
|
||||
git --no-pager log --oneline <boundary>..HEAD
|
||||
|
||||
# Changed file summary
|
||||
git --no-pager diff --stat <boundary>..HEAD
|
||||
```
|
||||
|
||||
From the visible conversation transcript: deploys run and their outcomes, test results seen.
|
||||
|
||||
## 3. Write the Session Entry
|
||||
|
||||
**APPEND** to `docs/sessions/YYYY-MM-DD.md` (create the file if it doesn't exist for today).
|
||||
Never overwrite existing content.
|
||||
|
||||
```markdown
|
||||
## Session HH:MM
|
||||
|
||||
### Commits
|
||||
<output of git log --oneline>
|
||||
|
||||
### Files changed
|
||||
<output of git diff --stat>
|
||||
|
||||
### Deploys
|
||||
<list from transcript, or "None recorded">
|
||||
|
||||
### Narrative
|
||||
> _user-provided summary_
|
||||
```
|
||||
|
||||
The `> _user-provided summary_` placeholder is **mandatory**. Never fill it in. The user supplies the narrative separately if desired.
|
||||
|
||||
## 4. What NOT to Touch
|
||||
|
||||
- `backlog.md` — only on explicit "update backlog" instruction
|
||||
- `CLAUDE.md` — only on explicit "update CLAUDE.md" instruction
|
||||
- Any other file not listed above
|
||||
|
||||
## 5. Commit
|
||||
|
||||
Stage and commit **only** the session file:
|
||||
|
||||
```bash
|
||||
git add docs/sessions/YYYY-MM-DD.md
|
||||
git commit -m "docs: session YYYY-MM-DD HH:MM"
|
||||
```
|
||||
|
||||
No other files. No `git add -A`.
|
||||
81
.claude/skills/worktree-aware/SKILL.md
Normal file
81
.claude/skills/worktree-aware/SKILL.md
Normal file
|
|
@ -0,0 +1,81 @@
|
|||
---
|
||||
name: worktree-aware
|
||||
description: >
|
||||
Use when working in a git worktree checkout for a parallel agent task.
|
||||
The presence of an .agent-task file in the current working directory indicates
|
||||
a task worktree (NOT the main checkout). Encodes branch hygiene: commit only
|
||||
to the assigned task branch, NEVER push origin master, NEVER touch the main
|
||||
checkout at ~/homelab-codex-ws, NEVER manage worktrees yourself. On task
|
||||
completion, report the branch name verbatim and stop — the human merges via
|
||||
scripts/dev/agent.sh.
|
||||
---
|
||||
|
||||
## When this applies
|
||||
|
||||
- `.agent-task` present in your `cwd` → you are in a task worktree. Apply all rules below.
|
||||
- `.agent-task` absent → you are in the main checkout. Do NOT treat yourself as a task agent.
|
||||
In the main checkout these rules do not apply.
|
||||
|
||||
## Reading the marker
|
||||
|
||||
`.agent-task` is a YAML file. Your assigned branch is the value of the `branch:` key, e.g.:
|
||||
|
||||
```yaml
|
||||
task: my-feature
|
||||
branch: task/my-feature
|
||||
parent_commit: abc1234
|
||||
created_utc: 2026-06-03T10:00:00Z
|
||||
worktree_path: /home/oskar/homelab-codex-ws-my-feature
|
||||
```
|
||||
|
||||
Always read this file first before taking any action.
|
||||
|
||||
## Rules
|
||||
|
||||
1. **Commit only to your branch.**
|
||||
Before any `git commit`, run `git status` and confirm it says `On branch task/<name>`.
|
||||
If it does not, stop immediately and report the discrepancy.
|
||||
|
||||
2. **Push only to your branch.**
|
||||
The only permitted push is `git push origin task/<name>`.
|
||||
NEVER `git push origin master` or any other branch.
|
||||
|
||||
3. **Do not touch the main checkout.**
|
||||
`~/homelab-codex-ws/` is the main checkout — deploy-only, owned by the human.
|
||||
Do not read from, write to, or execute commands inside it.
|
||||
|
||||
4. **Stay scoped.**
|
||||
Only change files directly related to your assigned task.
|
||||
If you notice other problems, report them in your final summary as separate follow-up proposals.
|
||||
Do not fix them in this worktree.
|
||||
|
||||
5. **Never `git add -A`.**
|
||||
Always stage specific files by name: `git add path/to/file`.
|
||||
|
||||
6. **Do not manage worktrees.**
|
||||
Never run `git worktree add/remove` or invoke `scripts/dev/agent.sh`.
|
||||
Worktree lifecycle is the human's responsibility.
|
||||
|
||||
7. **Final report before stopping.**
|
||||
When the task is done, provide a structured report containing:
|
||||
- Files changed (path and one-line summary of change)
|
||||
- Tests run and results
|
||||
- All commit hashes on the task branch
|
||||
- **Branch name verbatim** (copy-paste ready)
|
||||
- Follow-up items as bulleted proposals for separate tasks
|
||||
|
||||
## Definition of Done
|
||||
|
||||
- All commits are on `task/<name>` (verify with `git log --oneline master..task/<name>`)
|
||||
- Test suite passes
|
||||
- Branch pushed: `git push origin task/<name>`
|
||||
- Full report delivered in conversation
|
||||
|
||||
## What you do NOT do
|
||||
|
||||
- Merge branches
|
||||
- Create or push tags
|
||||
- Run deploys or healthchecks against production nodes
|
||||
- Delete branches or worktrees
|
||||
- Modify files in other worktrees
|
||||
- Push to `origin master` under any circumstances
|
||||
1
.gitignore
vendored
1
.gitignore
vendored
|
|
@ -15,6 +15,7 @@ __pycache__/
|
|||
*$py.class
|
||||
venv/
|
||||
.venv/
|
||||
*.egg-info/
|
||||
|
||||
# Tools
|
||||
.aider*
|
||||
|
|
|
|||
194
CLAUDE.md
Normal file
194
CLAUDE.md
Normal file
|
|
@ -0,0 +1,194 @@
|
|||
# CLAUDE.md
|
||||
|
||||
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
||||
|
||||
## What This Repo Is
|
||||
|
||||
GitOps-lite orchestration for a distributed homelab. The repo is the source of truth for infrastructure definitions; runtime state lives at `/opt/homelab/` on each execution node and is never committed.
|
||||
|
||||
## Node Roles
|
||||
|
||||
| Host | Role |
|
||||
|------|------|
|
||||
| **SATURN** | Primary control node — only node where commits are made |
|
||||
| **SOLARIA** | GPU/compute/AI workloads |
|
||||
| **PIHA** | Infra, monitoring |
|
||||
| **VPS** | Public ingress, reverse proxy, control plane host |
|
||||
| **CHELSTY-INFRA** | LTE edge hypervisor (site: chelsty); Zigbee2MQTT, Mosquitto, stability-agent — offline-first |
|
||||
| **CHELSTY-HA** | LTE Home Assistant VM (site: chelsty); connects to CHELSTY-INFRA MQTT broker — offline-first |
|
||||
|
||||
All nodes communicate over Tailscale. CHELSTY-INFRA and CHELSTY-HA have an intermittent LTE uplink; their services must never depend on SATURN, VPS, or Forgejo at runtime. Full node capabilities: `hosts/<node>/capabilities.yaml`.
|
||||
|
||||
## Deployment
|
||||
|
||||
```bash
|
||||
scripts/deploy/deploy.sh # fresh deploy on current node
|
||||
scripts/deploy/deploy.sh --resume # resume after interruption
|
||||
scripts/deploy/deploy.sh --stage verify # specific stage only
|
||||
scripts/deploy/deploy.sh --service mosquitto # specific service only
|
||||
./scripts/deploy/deploy-control-plane.sh --ssh # SATURN/SOLARIA → VPS
|
||||
./scripts/deploy/deploy-node.sh chelsty-infra # CHELSTY nodes (individually)
|
||||
./scripts/bootstrap/prepare-node.sh # general node bootstrap
|
||||
./scripts/bootstrap/chelsty-runtime.sh # CHELSTY-specific bootstrap
|
||||
```
|
||||
|
||||
Pipeline stages: **prepare → validate → deploy → verify → diagnose (on failure) → complete**. Stage state persisted in `/opt/homelab/state/deploy/`.
|
||||
|
||||
## Service Structure
|
||||
|
||||
Every service must follow this layout:
|
||||
|
||||
```
|
||||
services/<service>/
|
||||
├── docker-compose.yml
|
||||
├── service.yaml # Machine-readable contract (primary source of truth for agents)
|
||||
├── README.md
|
||||
├── env.example # Template — never commit actual secrets
|
||||
└── healthcheck.sh # Returns 0 (healthy) or 1 (unhealthy)
|
||||
```
|
||||
|
||||
`service.yaml` defines `owner_node`, `exposure`, `dependencies`, `healthcheck`, `restart_policy`, `persistence.paths`, and `runtime.env_vars`. This is what AI agents read to understand how to manage a service.
|
||||
|
||||
Host-specific runtime config and secrets live at `/opt/homelab/config/<service>/` on the target node (not in Git). Docker Compose overrides are version-controlled at `hosts/<node>/runtime/<service>/docker-compose.override.yml` in this repo and applied during deployment.
|
||||
|
||||
## Agent System Architecture
|
||||
|
||||
The platform uses a multi-agent model with **human-in-the-loop** for destructive actions:
|
||||
|
||||
1. **Stability Agent** (`services/stability-agent/`) — Per-node watchdog. Monitors Docker containers, disk, Tailscale, MQTT. Emits filesystem events. Does NOT restart services autonomously.
|
||||
2. **Observer** (`services/control-plane/src/`) — Synthesizes world state from events into `/opt/homelab/world/{nodes,services,deployments,incidents}.json`.
|
||||
3. **Supervisor** — Detects drift between desired state (from `hosts/*/services.yaml`) and actual state (from Observer output). Writes `pending` action JSON files.
|
||||
4. **Executor** — Executes actions only after they transition to `approved`.
|
||||
5. **Operator UI** + **Telegram Bot** — Operators review and approve/reject pending actions.
|
||||
|
||||
### Action approval flow
|
||||
```
|
||||
Agent → /opt/homelab/actions/pending/<id>.json
|
||||
→ Telegram notification → Operator approves
|
||||
→ /opt/homelab/actions/approved/<id>.json
|
||||
→ Executor runs → completed / failed
|
||||
```
|
||||
|
||||
Agents must never execute destructive actions (restarts, deploys, config changes) without a corresponding approved action file.
|
||||
|
||||
## Event System
|
||||
|
||||
Events are append-only JSON lines at `/opt/homelab/events/YYYY-MM-DD/<node>/events.jsonl`.
|
||||
|
||||
Emit via `scripts/lib/events.sh` (shell) or `scripts/lib/events.py` (Python).
|
||||
|
||||
Normalized event types: `deployment_started/completed/failed`, `service_unhealthy/recovered`, `node_offline/online`, `healthcheck_failed`, `remediation_started/completed`.
|
||||
|
||||
### Supervisor event routing table
|
||||
|
||||
| Event type | Source | Action generated | Cooldown |
|
||||
|---|---|---|---|
|
||||
| `containers_not_running` | stability-agent | `container_restart` | dedup via stable ID |
|
||||
| `mqtt_unreachable` | stability-agent | `container_restart` | dedup via stable ID |
|
||||
| `service_unhealthy` / other | stability-agent | `redeploy` | dedup via stable ID |
|
||||
| `disk_pressure` (high) | stability-agent | `disk_cleanup` | dedup via stable ID |
|
||||
| `ha_websocket_dead` | ha-diag-agent | `container_restart` (homeassistant) | 30 min after completion |
|
||||
| `ha_websocket_recovered` | ha-diag-agent | cancels matching restart | — |
|
||||
| `ha_integration_failed` | ha-diag-agent | `alert_only` | 1 hour |
|
||||
| `ha_entity_unavailable_long` | ha-diag-agent | `alert_only` | 1 hour |
|
||||
| `ha_automation_failing` | ha-diag-agent | `alert_only` | 1 hour |
|
||||
| `ha_update_available` | ha-diag-agent | `alert_only` | 1 hour |
|
||||
| `ha_recorder_lag` | ha-diag-agent | `alert_only` | 1 hour |
|
||||
| `ha_system_health_degraded` | ha-diag-agent | `alert_only` | 1 hour |
|
||||
|
||||
HA events are routed directly from the events directory by the supervisor (not via world-state drift loop) to avoid conflicts with stability-agent's independent container health tracking. HA events are suppressed if `homeassistant` had a `containers_not_running` incident within the last 5 minutes (planned restart/update in progress).
|
||||
|
||||
## Discovery Entry Points for Agents
|
||||
|
||||
When exploring the system, use these files in order:
|
||||
1. `inventory/topology.yaml` — node list, roles, mesh type
|
||||
2. `hosts/<node>/capabilities.yaml` — hardware and software constraints
|
||||
3. `hosts/<node>/services.yaml` — desired services and exposure classes for that host
|
||||
4. `services/<service>/service.yaml` — operational contract for a service
|
||||
|
||||
## VPS-Specific Rules
|
||||
|
||||
VPS has **4 GiB RAM, no swap**. Every repo-managed service must declare memory limits in its `hosts/vps/runtime/<service>/docker-compose.override.yml`.
|
||||
|
||||
### Memory limit convention
|
||||
|
||||
Use top-level Compose properties (not `deploy.resources.limits`, which requires Swarm mode):
|
||||
|
||||
```yaml
|
||||
services:
|
||||
myservice:
|
||||
mem_limit: 256m # cgroup ceiling; Docker restarts on breach
|
||||
oom_score_adj: -900 # host kernel OOM-killer will not pick this container
|
||||
```
|
||||
|
||||
Rules:
|
||||
- **Control-plane containers** (executor, observer, supervisor, operator-ui), **node-agent**, **stability-agent**: always set `oom_score_adj: -900` — these must never be a system-level OOM victim.
|
||||
- `mem_limit` still applies even with `oom_score_adj: -900`; the cgroup OOM killer is independent of the host OOM killer and will restart the container via Docker when the limit is exceeded.
|
||||
- Budget: OS+Docker reserves ~800 MiB; sum of all `mem_limit` values must stay ≤ 3200 MiB (3.1 GiB).
|
||||
|
||||
### Repo-managed services on VPS
|
||||
|
||||
All VPS services are now GitOps-managed. Service definitions live in `services/<name>/docker-compose.yml`; host-specific overrides (mem_limit, env) live in `hosts/vps/runtime/<name>/docker-compose.override.yml`.
|
||||
|
||||
| Service | Compose stack | Data path |
|
||||
|---|---|---|
|
||||
| npm | `services/npm/` | `/home/dockeruser/docker/npm/{data,letsencrypt}` (bind mount) |
|
||||
| outline | `services/outline/` | Docker named volumes: `outline_outline_storage`, `outline_postgres_data`, `outline_redis_data` |
|
||||
| joplin | `services/joplin/` | Docker named volume: `joplin_postgres_data` |
|
||||
| ai-cluster | `services/ai-cluster/` | Mosquitto config bind: `/home/dockeruser/docker/ai-cluster/mosquitto/` |
|
||||
|
||||
**Data migration rule**: data paths stay in place at cutover. Never move volumes or bind-mount sources without a dedicated migration plan.
|
||||
|
||||
**Cutover checklist** (before running `docker compose up` for any migrated service):
|
||||
1. `git pull` on VPS
|
||||
2. Populate `/opt/homelab/config/<service>/.env` from the `env.example` template
|
||||
3. For ai-cluster: copy `/home/dockeruser/docker/ai-cluster/.env` to `/opt/homelab/config/ai-cluster/.env`
|
||||
4. For mosquitto: config stays at old bind path until explicitly migrated
|
||||
5. Verify named volumes exist: `docker volume ls | grep <project>`
|
||||
|
||||
**ai-cluster architectural note**: compute workloads (codex-worker, planner-worker) belong on SOLARIA (GPU/compute node), not the 4 GB ingress VPS. Migrate when feasible; for now, hard mem_limits contain the blast radius.
|
||||
|
||||
## CHELSTY-Specific Rules
|
||||
|
||||
- Zigbee coordinator is **SLZB-06U** over TCP (`192.168.1.105:6638`, `ezsp` adapter). Never use `/dev/ttyUSB0`.
|
||||
- CHELSTY nodes run **docker-compose v1** (1.29.2) — use `docker-compose` (hyphenated), not `docker compose`.
|
||||
- Critical backup sets: HA config+data, Zigbee2MQTT config+db+network key, Mosquitto config+persistence, SLZB-06U coordinator state.
|
||||
|
||||
## Runtime Path Conventions
|
||||
|
||||
`/opt/homelab/` layout on each node:
|
||||
|
||||
- `data/<service>/` — persistent volumes
|
||||
- `config/<service>/` — secrets and host-local overrides (not in Git)
|
||||
- `logs/<service>/` — service logs
|
||||
- `state/` — deployment stage markers, agent heartbeats
|
||||
- `events/` — append-only event store
|
||||
- `world/` — Observer output (synthesized state)
|
||||
- `actions/` — pending / approved / running / completed / failed
|
||||
|
||||
## Definition of Done (serwisy)
|
||||
|
||||
Before any new or changed service is considered ready:
|
||||
|
||||
1. **docker build + smoke run** — build the image locally and run it for a few seconds; confirm the process starts its main loop without crashing. This catches packaging/import errors (e.g. `ModuleNotFoundError`) before they reach a node.
|
||||
2. **pytest** — run the service's test suite. If no tests exist yet, add a minimal one (at minimum: import passes, core logic has at least one case). Tests live in `services/<service>/tests/`.
|
||||
3. **Never commit or deploy code that has never been run.** If a smoke run or test fails, fix it first.
|
||||
|
||||
## Naming Conventions
|
||||
|
||||
- Hosts: ALL CAPS (`SATURN`, `PIHA`)
|
||||
- Services: kebab-case (`stability-agent`, `zigbee2mqtt`)
|
||||
- Container names must match service names
|
||||
- Always `restart: unless-stopped` unless `service.yaml` says otherwise
|
||||
|
||||
## Multi-agent worktree mode
|
||||
|
||||
`~/homelab-codex-ws` (main checkout) is **deploy-only** and belongs to the human operator.
|
||||
Parallel agent tasks run in isolated git worktrees created by `scripts/dev/agent.sh new <name>`.
|
||||
|
||||
If `.agent-task` exists in your current working directory, you are in a task worktree.
|
||||
**You must immediately read `.agent-task` and load `.claude/skills/worktree-aware/SKILL.md`
|
||||
before taking any action.** That skill defines all branch-hygiene rules for task worktrees.
|
||||
|
||||
Worktree lifecycle commands: `agent.sh new | list | merge | clean`.
|
||||
Agents never invoke these — only the human does.
|
||||
19
README.md
19
README.md
|
|
@ -13,6 +13,22 @@ The homelab consists of several nodes connected via a Tailscale internal mesh.
|
|||
| **PIHA** | Infra Node | Core infrastructure services, automation, and monitoring. |
|
||||
| **VPS** | Edge Node | Public ingress, reverse proxy, and edge services. |
|
||||
|
||||
## Agent System
|
||||
|
||||
The homelab uses a multi-agent orchestration model with human-in-the-loop for destructive actions:
|
||||
|
||||
| Agent | Node | Role |
|
||||
|-------|------|------|
|
||||
| **stability-agent** | all nodes | Per-node watchdog — monitors Docker, disk, Tailscale, MQTT; emits events |
|
||||
| **node-agent** | all nodes | Publishes container health events to Redis pub/sub |
|
||||
| **observer** | VPS | Synthesizes world state from events into `/opt/homelab/world/*.json` |
|
||||
| **supervisor** | VPS | Detects drift between desired and actual state; writes `pending` actions |
|
||||
| **planner-agent** | SOLARIA | LLM-powered diagnosis — listens to Redis, proposes remediation actions |
|
||||
| **executor** | VPS | Executes actions only after operator approval |
|
||||
| **operator-ui** + **telegram-bot** | VPS / PIHA | Operator reviews and approves/rejects pending actions |
|
||||
|
||||
Action approval flow: `pending/` → operator approves → `approved/` → executor runs.
|
||||
|
||||
## Repository Structure
|
||||
|
||||
- `docs/`: [Infrastructure Standards](docs/standards.md) and [Deployment Conventions](docs/deployment.md).
|
||||
|
|
@ -29,10 +45,13 @@ The homelab consists of several nodes connected via a Tailscale internal mesh.
|
|||
## Documentation Index
|
||||
|
||||
- [Infrastructure Standards](docs/standards.md)
|
||||
- [Agent Operating Procedures](docs/agents.md) (For AI/Non-Human Agents)
|
||||
- [Deployment Conventions](docs/deployment.md)
|
||||
- [Hardware](docs/hardware.md)
|
||||
- [Networking](docs/networking.md)
|
||||
- [Services](docs/services.md)
|
||||
- [Node Capabilities](docs/capabilities.md)
|
||||
- [Action Model](services/agent-system/action-model.md)
|
||||
|
||||
---
|
||||
*Note: This repository documents the state of the homelab. Runtime state lives outside the repository in `/opt/homelab`.*
|
||||
|
|
|
|||
31
backups/zigbee/coordinator_backup.json
Normal file
31
backups/zigbee/coordinator_backup.json
Normal file
|
|
@ -0,0 +1,31 @@
|
|||
{
|
||||
"metadata": {
|
||||
"format": "zigpy/open-coordinator-backup",
|
||||
"version": 1,
|
||||
"source": "zigbee-herdsman@10.0.7",
|
||||
"internal": {
|
||||
"date": "2026-05-14T14:48:35.098Z",
|
||||
"znpVersion": 1
|
||||
}
|
||||
},
|
||||
"stack_specific": {
|
||||
"zstack": {
|
||||
"tclk_seed": "32d69cbe3f0e15471e5d43f9401e485a"
|
||||
}
|
||||
},
|
||||
"coordinator_ieee": "00124b00257bf416",
|
||||
"pan_id": "46bc",
|
||||
"extended_pan_id": "087730b5f614ea4a",
|
||||
"nwk_update_id": 0,
|
||||
"security_level": 5,
|
||||
"channel": 11,
|
||||
"channel_mask": [
|
||||
11
|
||||
],
|
||||
"network_key": {
|
||||
"key": "049909949a950d91522cf10cc369a724",
|
||||
"sequence_number": 0,
|
||||
"frame_counter": 0
|
||||
},
|
||||
"devices": []
|
||||
}
|
||||
49
docs/agents.md
Normal file
49
docs/agents.md
Normal file
|
|
@ -0,0 +1,49 @@
|
|||
# Agent Operating Procedures
|
||||
|
||||
This document defines the operating procedures, constraints, and interaction protocols for non-human agents (AI agents, autonomous scripts) within the Homelab Codex ecosystem.
|
||||
|
||||
## 1. Core Principles for Agents
|
||||
|
||||
1. **Read-Only by Default**: Agents should assume read-only access to the `/opt/homelab` runtime unless explicitly executing an approved action.
|
||||
2. **Git as Authority**: The repository on **SATURN** is the source of truth. Agents must not modify the runtime state on nodes directly without corresponding (or pending) Git state, unless it's an emergency mitigation.
|
||||
3. **Human-in-the-Loop (HIL)**: All destructive or structural changes (restarts, deployments, config changes) must follow the [Action Approval Model](../services/agent-system/action-model.md).
|
||||
4. **Idempotency**: All scripts and actions proposed or executed by agents MUST be idempotent.
|
||||
5. **Context-Awareness**: Agents MUST read the `README.md` and `docs/agents.md` at the start of every session to align with current infrastructure standards.
|
||||
|
||||
## 2. Agent Roles
|
||||
|
||||
| Role | Responsibility | Scope |
|
||||
|------|----------------|-------|
|
||||
| **Observer** | Monitors health, logs, and events. | Read-only access to `/opt/homelab/events` and `logs`. |
|
||||
| **Stability Agent** | Local node watchdog, event emitter. | Local node runtime, `service.yaml` healthchecks. |
|
||||
| **Orchestrator** | High-level planning, workload placement. | Repository-wide, multi-node topology. |
|
||||
| **Materializer** | Translates high-level intent into Docker/System state. | Execution of `approved` actions. |
|
||||
|
||||
## 3. Discovery Protocol
|
||||
|
||||
Agents must use the following entry points to understand the system:
|
||||
|
||||
1. **Topology**: `inventory/topology.yaml` for node list and roles.
|
||||
2. **Capabilities**: `hosts/<node>/capabilities.yaml` to understand hardware/software constraints.
|
||||
3. **Service Contract**: `services/<service>/service.yaml` to understand how to check health and manage a service.
|
||||
4. **Operational State**: `/opt/homelab/state/` on local nodes for real-time status.
|
||||
|
||||
## 4. Interaction with Humans
|
||||
|
||||
Agents communicate with the operator via the `agent-system/telegram-bot`.
|
||||
|
||||
- **Alerting**: Agents emit events to the event system. Critical events are forwarded to Telegram.
|
||||
- **Proposals**: When an agent identifies a need for change (e.g., "Service X is failing, suggest restart"), it creates a `pending` action in `/opt/homelab/actions/pending/`.
|
||||
- **Approval**: Agents must wait for the action status to transition to `approved` before execution.
|
||||
|
||||
## 5. Decision Logic (Reasoning)
|
||||
|
||||
When making decisions, agents MUST prioritize:
|
||||
1. **Safety**: Do not violate power constraints (see `capabilities.yaml`).
|
||||
2. **Stability**: Prefer keeping services on their `owner_node` unless it's down.
|
||||
3. **Connectivity**: On intermittent nodes (CHELSTY), avoid actions requiring heavy WAN traffic during low-signal periods.
|
||||
|
||||
## 6. Access Control for Agents
|
||||
|
||||
- **Filesystem**: Agents should run as the `homelab` user or equivalent with restricted sudo access to `docker compose`.
|
||||
- **Secrets**: Agents MUST NOT attempt to read `.env` files unless specifically tasked with credential rotation. They should treat secrets as opaque handles.
|
||||
|
|
@ -83,3 +83,10 @@ Future autonomous agents will use this metadata to:
|
|||
2. **Generate Plans:** Create step-by-step deployment or migration plans based on hardware compatibility.
|
||||
3. **Validate Topology:** Ensure that a proposed multi-node setup doesn't violate networking or operational constraints (e.g., don't put a DB on an intermittent node).
|
||||
4. **Propose Failover:** Automatically suggest the best alternative node during an outage.
|
||||
|
||||
## Agent Reasoning Logic
|
||||
|
||||
When an agent parses `capabilities.yaml`, it should apply these heuristics:
|
||||
- **Intermittent Connectivity**: If `operational.connectivity == "intermittent"`, do not schedule high-bandwidth syncs or critical cloud-dependent services.
|
||||
- **Power Constraints**: If `operational.power_constraint == "low-power"`, avoid heavy LLM inference or continuous high-CPU tasks.
|
||||
- **Availability Target**: If `availability_target == "high"`, this node is a candidate for hosting control-plane failovers.
|
||||
|
|
|
|||
|
|
@ -1,60 +1,154 @@
|
|||
# CHELSTY Runtime
|
||||
|
||||
This document describes the runtime environment and deployment flow for CHELSTY, an offline-capable home automation edge node.
|
||||
This document describes the runtime environment and deployment flow for CHELSTY, an offline-capable home automation edge node split across two VMs.
|
||||
|
||||
| Node | Role | Services |
|
||||
|------|------|----------|
|
||||
| `chelsty-infra` | LTE edge hypervisor | Mosquitto, Zigbee2MQTT, stability-agent, node-agent |
|
||||
| `chelsty-ha` | Home Assistant VM | homeassistant (no node-agent — see below) |
|
||||
|
||||
Both nodes share an LTE uplink and must function fully offline (Zigbee, MQTT, HA automations) without any connectivity to SATURN, VPS, or Forgejo.
|
||||
|
||||
## Runtime Layout
|
||||
|
||||
The CHELSTY runtime is located at `/opt/homelab`.
|
||||
|
||||
- `/opt/homelab/config/`: Service-specific configurations and compose overrides.
|
||||
- `/opt/homelab/data/`: Persistent data for services.
|
||||
- `/opt/homelab/logs/`: Service logs.
|
||||
|
||||
### Key Service Locations
|
||||
- **Mosquitto**: `/opt/homelab/config/mosquitto/`
|
||||
- **Zigbee2MQTT**: `/opt/homelab/config/zigbee2mqtt/`
|
||||
```
|
||||
/opt/homelab/
|
||||
├── config/ # Service-specific configs and secrets (not in Git)
|
||||
│ ├── mosquitto/
|
||||
│ └── zigbee2mqtt/
|
||||
├── data/ # Persistent service data
|
||||
│ ├── mosquitto/ # Persistence DB, password file
|
||||
│ └── zigbee2mqtt/
|
||||
│ └── data/ # z2m config, coordinator backup, network key
|
||||
└── logs/
|
||||
```
|
||||
|
||||
## SLZB-06U Integration
|
||||
|
||||
CHELSTY uses a SMLIGHT SLZB-06U Zigbee coordinator connected via Ethernet/TCP.
|
||||
CHELSTY uses a SMLIGHT SLZB-06U Zigbee coordinator connected over Ethernet/TCP.
|
||||
|
||||
- **Coordinator IP**: 192.168.1.105
|
||||
- **Port**: 6638
|
||||
- **Protocol**: TCP (ezsp adapter)
|
||||
- **Coordinator IP**: `192.168.1.105`
|
||||
- **Port**: `6638`
|
||||
- **Adapter**: `ezsp` (deprecated — migration to `ember` recommended, requires only changing `adapter: ember` in `configuration.yaml`)
|
||||
- **Zigbee2MQTT config key**: `serial.port: tcp://192.168.1.105:6638`
|
||||
|
||||
Zigbee2MQTT is configured to connect to this coordinator over the local network.
|
||||
⚠️ Never use `/dev/ttyUSB0` — the coordinator is always TCP-only on this site.
|
||||
|
||||
## Offline & LTE Assumptions
|
||||
## Networking Constraints
|
||||
|
||||
- **WAN Resilience**: All core automation (Zigbee, MQTT) runs locally on CHELSTY.
|
||||
- **Connectivity**: LTE provides intermittent uplink for remote management and Tailscale access.
|
||||
- **Home Assistant**: Runs in a separate VM, connecting to the Mosquitto broker on CHELSTY.
|
||||
### Mosquitto — `network_mode: host`
|
||||
Mosquitto runs with `network_mode: host` so that all containers on the same host can reach it at `localhost:1883`. **Do not change this.**
|
||||
|
||||
### Zigbee2MQTT — bridge network + extra_hosts
|
||||
Zigbee2MQTT runs in a bridge-networked container (needed for port mapping compatibility with docker-compose v1). To reach the host-networked Mosquitto:
|
||||
|
||||
```yaml
|
||||
# hosts/chelsty-infra/runtime/zigbee2mqtt/docker-compose.override.yml
|
||||
services:
|
||||
zigbee2mqtt:
|
||||
extra_hosts:
|
||||
- "mosquitto:host-gateway"
|
||||
```
|
||||
|
||||
This maps the `mosquitto` hostname inside the z2m container to the Docker host gateway IP, so `mqtt://mosquitto:1883` reaches the host-networked Mosquitto process.
|
||||
|
||||
**Why not `network_mode: host` for z2m?**
|
||||
chelsty-infra runs docker-compose v1 (1.29.2). In v1, `network_mode: host` cannot coexist with `ports:` declared in the base `docker-compose.yml` — raises `InvalidArgument`. The `extra_hosts` approach avoids this.
|
||||
|
||||
## Zigbee2MQTT Config Location
|
||||
|
||||
The `configuration.yaml` **must be writable** — z2m migrates and rewrites it on startup. It lives in the data directory:
|
||||
|
||||
```
|
||||
/opt/homelab/data/zigbee2mqtt/data/configuration.yaml
|
||||
```
|
||||
|
||||
This path is mounted read-write by the base `docker-compose.yml`:
|
||||
```yaml
|
||||
volumes:
|
||||
- /opt/homelab/data/zigbee2mqtt/data:/app/data
|
||||
```
|
||||
|
||||
Do **not** mount `configuration.yaml` as a separate `:ro` volume — z2m will fail with `EROFS`.
|
||||
|
||||
### Minimal configuration.yaml
|
||||
```yaml
|
||||
homeassistant: true
|
||||
permit_join: false
|
||||
mqtt:
|
||||
base_topic: zigbee2mqtt
|
||||
server: mqtt://mosquitto:1883
|
||||
serial:
|
||||
port: tcp://192.168.1.105:6638
|
||||
adapter: ezsp
|
||||
frontend:
|
||||
port: 8080
|
||||
advanced:
|
||||
log_level: info
|
||||
```
|
||||
|
||||
## chelsty-ha — No node-agent
|
||||
|
||||
`chelsty-ha` does not have a node-agent deployed. Home Assistant is monitored indirectly: if MQTT goes silent on `chelsty-infra`, HA is likely down.
|
||||
|
||||
In `hosts/chelsty-ha/services.yaml`:
|
||||
```yaml
|
||||
services:
|
||||
homeassistant:
|
||||
monitor: false # No node-agent; suppresses supervisor action generation
|
||||
```
|
||||
|
||||
Remove `monitor: false` once node-agent is bootstrapped on this VM.
|
||||
|
||||
## Deployment Flow
|
||||
|
||||
1. **Initial Bootstrap**:
|
||||
Run the bootstrap script on the CHELSTY node:
|
||||
```bash
|
||||
./scripts/bootstrap/chelsty-runtime.sh
|
||||
```
|
||||
### Initial Bootstrap
|
||||
```bash
|
||||
./scripts/bootstrap/chelsty-runtime.sh
|
||||
```
|
||||
|
||||
2. **Manual Configuration**:
|
||||
- Edit `/opt/homelab/config/zigbee2mqtt/.env` with MQTT credentials.
|
||||
- Add Mosquitto user:
|
||||
```bash
|
||||
sudo mosquitto_passwd -b /opt/homelab/data/mosquitto/config/password.txt <user> <password>
|
||||
```
|
||||
### Deploy services
|
||||
```bash
|
||||
./scripts/deploy/deploy-node.sh chelsty-infra
|
||||
./scripts/deploy/deploy-node.sh chelsty-ha
|
||||
```
|
||||
|
||||
3. **Service Deployment**:
|
||||
Use the staged deployment runtime:
|
||||
```bash
|
||||
./scripts/deploy/deploy-node.sh chelsty
|
||||
```
|
||||
### Manual (SSH) — chelsty-infra uses docker-compose v1
|
||||
```bash
|
||||
ssh oskar@100.122.201.22
|
||||
cd ~/homelab-codex-ws/services/<service>
|
||||
docker-compose -f docker-compose.yml \
|
||||
-f ../../hosts/chelsty-infra/runtime/<service>/docker-compose.override.yml \
|
||||
up -d --build --force-recreate
|
||||
```
|
||||
|
||||
## Recovery Procedure
|
||||
> **Note:** `docker compose` (v2) is **not** available on chelsty-infra — always use `docker-compose` (hyphenated, v1 1.29.2).
|
||||
|
||||
In case of runtime failure:
|
||||
1. Verify Docker and Compose plugin: `docker compose version`
|
||||
2. Re-run bootstrap script to ensure directory structure and basic configs.
|
||||
3. Check Mosquitto logs: `tail -f /opt/homelab/data/mosquitto/log/mosquitto.log`
|
||||
4. Verify SLZB-06U reachability: `ping 192.168.1.105`
|
||||
## Recovery Procedures
|
||||
|
||||
### Mosquitto stopped
|
||||
```bash
|
||||
ssh oskar@100.122.201.22 "docker start mosquitto"
|
||||
# Ensure restart policy is correct:
|
||||
docker update --restart unless-stopped mosquitto
|
||||
```
|
||||
|
||||
### Zigbee2MQTT won't start
|
||||
1. Check logs: `docker logs zigbee2mqtt --tail 50`
|
||||
2. Verify SLZB-06U reachable from host: `nc -zv 192.168.1.105 6638`
|
||||
3. Verify config is not empty: `cat /opt/homelab/data/zigbee2mqtt/data/configuration.yaml`
|
||||
4. If config missing, recreate from the minimal template above
|
||||
|
||||
### SLZB-06U unreachable
|
||||
`192.168.1.105:6638` EHOSTUNREACH means the coordinator is offline or the LAN is down. Zigbee2MQTT will keep retrying — no restart needed once the coordinator returns.
|
||||
|
||||
## Critical Backup Sets
|
||||
|
||||
| Data | Path |
|
||||
|------|------|
|
||||
| HA config + DB | `/opt/homelab/data/homeassistant/` on chelsty-ha |
|
||||
| z2m config + coordinator backup + network key | `/opt/homelab/data/zigbee2mqtt/data/` |
|
||||
| Mosquitto persistence + password file | `/opt/homelab/data/mosquitto/` |
|
||||
| SLZB-06U coordinator state | Backup via SLZB-06U web UI at `192.168.1.105` |
|
||||
|
||||
> ⚠️ The Zigbee network key is in `configuration.yaml` or `coordinator_backup.json` — losing it requires re-pairing all devices.
|
||||
|
|
|
|||
42
docs/chelsty-stability-agent.md
Normal file
42
docs/chelsty-stability-agent.md
Normal file
|
|
@ -0,0 +1,42 @@
|
|||
### CHELSTY Stability Agent
|
||||
|
||||
The stability-agent on CHELSTY provides local observability and health monitoring for the node's services and infrastructure.
|
||||
|
||||
#### Purpose
|
||||
|
||||
It acts as a filesystem-first watchdog that detects anomalies in the local runtime environment without taking autonomous destructive actions (like restarts). It serves as the primary data source for node-level stability metrics.
|
||||
|
||||
#### Monitoring Scope
|
||||
|
||||
* **Docker Containers**: Monitors all local containers. If a container is not in the `running` state, a `containers_not_running` event is generated.
|
||||
* **Disk Usage**: Monitors the root filesystem. Generates `disk_usage_high` events if usage exceeds the configured threshold.
|
||||
* **Connectivity**:
|
||||
* Checks if the Tailscale socket or interface is available.
|
||||
* Checks reachability of the local Mosquitto MQTT broker.
|
||||
* **Zigbee2MQTT**: Specifically tracks the presence and status of the Zigbee2MQTT service.
|
||||
|
||||
#### Storage and Integration
|
||||
|
||||
* **Heartbeat**: Updated every cycle at `/opt/homelab/state/stability-agent.heartbeat`.
|
||||
* **State Summary**: A JSON summary of all latest checks at `/opt/homelab/state/stability-agent.json`.
|
||||
* **Events**: Append-only JSON lines at `/opt/homelab/events/YYYY-MM-DD/chelsty-infra/events.jsonl`.
|
||||
|
||||
#### Deployment
|
||||
|
||||
The service is deployed via Docker Compose on CHELSTY.
|
||||
|
||||
```bash
|
||||
cd services/stability-agent
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
#### Configuration
|
||||
|
||||
Configuration is managed via environment variables in `docker-compose.override.yml` on the host.
|
||||
|
||||
| Variable | Description | Default |
|
||||
|----------|-------------|---------|
|
||||
| `STABILITY_CHECK_INTERVAL` | Seconds between checks | `60` |
|
||||
| `DISK_THRESHOLD_PCT` | Disk usage alert threshold | `90` |
|
||||
| `MQTT_HOST` | MQTT broker hostname | `mosquitto` |
|
||||
| `MQTT_PORT` | MQTT broker port | `1883` |
|
||||
98
docs/observer-runtime.md
Normal file
98
docs/observer-runtime.md
Normal file
|
|
@ -0,0 +1,98 @@
|
|||
# Observer Runtime
|
||||
|
||||
The Observer Runtime is a lightweight agent responsible for synthesizing the operational world state of the homelab from raw events, logs, and state files.
|
||||
|
||||
## Architecture
|
||||
|
||||
The observer follows a filesystem-first approach, consuming append-only events and generating a normalized world model. It is designed to be idempotent, resumable, and resilient to intermittent node connectivity.
|
||||
|
||||
### Inputs
|
||||
- `/opt/homelab/events/`: Normalized JSON events (one `.json` file per event, organized by date and node).
|
||||
- `/opt/homelab/state/observer_checkpoint.json`: Per-node checkpoint dict (see below).
|
||||
- Repository Inventory: `inventory/topology.yaml` and `hosts/*/services.yaml`.
|
||||
|
||||
### World Model Output
|
||||
Generated under `/opt/homelab/world/`:
|
||||
- `nodes.json`: Current node availability, roles, disk/memory pressure, last seen timestamps. Dict keyed by node name.
|
||||
- `services.json`: Service health status and links to active incidents. Dict keyed by `"node/service"`.
|
||||
- `deployments.json`: Tracking of active and historical deployment runs by `correlation_id`.
|
||||
- `incidents.json`: Correlated operational issues, including repeat failures and resolution status.
|
||||
- `runtime-summary.json`: High-level overview for dashboards and planner agents.
|
||||
|
||||
## Checkpoint Format
|
||||
|
||||
The observer tracks per-node progress to avoid silently skipping event directories:
|
||||
|
||||
```json
|
||||
{
|
||||
"node_checkpoints": {
|
||||
"vps": "/opt/homelab/events/2026-05-27/vps/evt-vps-1234.json",
|
||||
"piha": "/opt/homelab/events/2026-05-27/piha/evt-piha-5678.json",
|
||||
"chelsty-infra": "/opt/homelab/events/2026-05-27/chelsty-infra/evt-chelsty-infra-9012.json"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
A single global checkpoint (`last_processed_file`) was replaced with this per-node dict because the old approach silently skipped any node directory that sorts alphabetically before the last-seen node (e.g. `piha/` would be skipped when the checkpoint pointed to `vps/`).
|
||||
|
||||
**Reset:** Delete `/opt/homelab/state/observer_checkpoint.json`. The observer will reprocess all events and rebuild world state from scratch.
|
||||
|
||||
## Event Types
|
||||
|
||||
### Negative events (create/escalate incidents)
|
||||
- `service_unhealthy`, `healthcheck_failed` — open or increment an active incident
|
||||
- `deployment_failed` — record failure in deployments.json
|
||||
|
||||
### Positive events (resolve state)
|
||||
- `service_healthy` — marks service status as `healthy` **and** resolves any active incident for that service
|
||||
- `service_recovered` — alias, same effect
|
||||
- `deployment_completed` — marks deployment as completed
|
||||
|
||||
### Node events
|
||||
- `node_online`, `node_offline` — update node status in nodes.json
|
||||
- `disk_pressure_*` — set `disk_pressure` field on the node record
|
||||
|
||||
## Incident Lifecycle
|
||||
|
||||
1. **Detection**: A `service_unhealthy` or `healthcheck_failed` event creates or increments an active incident.
|
||||
2. **Correlation**: Multiple failure events for the same `node/service` are collapsed into one incident, incrementing `occurrence_count`.
|
||||
3. **Resolution**: A `service_healthy` or `service_recovered` event resolves any active incident for that service, setting `status: resolved` and `resolved_at`.
|
||||
4. **Expiry**: Resolved incidents older than 7 days are pruned from world state by `_prune_stale_world()`.
|
||||
|
||||
### Example Incident JSON
|
||||
```json
|
||||
{
|
||||
"inc-1715518800-vps-observer": {
|
||||
"id": "inc-1715518800-vps-observer",
|
||||
"node": "vps",
|
||||
"service": "observer",
|
||||
"status": "resolved",
|
||||
"severity": "error",
|
||||
"started_at": 1715518800.0,
|
||||
"last_occurrence": 1715518860.0,
|
||||
"occurrence_count": 2,
|
||||
"trigger_type": "containers_not_running",
|
||||
"resolved_at": 1715519100.0
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## World State Pruning
|
||||
|
||||
`_prune_stale_world()` runs every reconcile cycle and removes:
|
||||
|
||||
1. **Stale nodes** — nodes not present in `inventory/topology.yaml` (e.g. ghost nodes created when `NODE_NAME` was unset and fell back to the container's 12-char hex ID).
|
||||
2. **Services of stale nodes** — all `node/service` keys whose node was pruned.
|
||||
3. **Ghost service keys** — service keys whose service-name portion matches the pattern `<12hexchars>_<name>` (Docker internal stale-state artifacts, created when node-agent used `c.name` instead of the compose label).
|
||||
4. **Expired incidents** — resolved incidents older than 7 days.
|
||||
|
||||
## Runtime Behavior
|
||||
|
||||
### Idempotency
|
||||
The observer processes events in order. Deleting the checkpoint and restarting replays all events and produces the same world state.
|
||||
|
||||
### Deployment Tracking
|
||||
Deployments are tracked via `correlation_id`. The observer synthesizes the start, end, and status of each deployment run from events.
|
||||
|
||||
### Topology Filtering
|
||||
Events from nodes not listed in `inventory/topology.yaml` are discarded during pruning. This prevents transient bootstrap noise from polluting world state.
|
||||
234
docs/sessions/2026-05-27-planner-agent.md
Normal file
234
docs/sessions/2026-05-27-planner-agent.md
Normal file
|
|
@ -0,0 +1,234 @@
|
|||
# SESSION: Budowa planner-agent — LLM-based diagnostics
|
||||
|
||||
**DATA:** 2026-05-27
|
||||
**REZULTAT:** planner-agent działa na SOLARIA (`healthy`), Ollama primary, cloud fallback gotowy do włączenia
|
||||
|
||||
---
|
||||
|
||||
## Co zostało zbudowane
|
||||
|
||||
### `services/planner-agent/src/llm_router.py`
|
||||
|
||||
Moduł LLM routing z local-first fallback chain:
|
||||
|
||||
- **`LLMRouter`** — główna klasa routingu przez litellm
|
||||
- **`ModelConfig`** — konfiguracja jednego modelu (name, timeout, api_base, extra_kwargs)
|
||||
- **`ModelMetrics`** — liczniki per model × outcome (`success`/`fallback`/`error`); success_rate
|
||||
- **`RouteResult`** — wynik routingu z `content`, `model_used`, `attempts`, `latency_ms`
|
||||
- **`AttemptRecord`** — zapis jednej próby (model, outcome, reason, latency_ms)
|
||||
- **`_extract_json_from_fence()`** — wydobywa JSON z bloków ` ```json ``` ` jeśli model nie odpowie czystym JSON
|
||||
|
||||
Domyślny chain: `ollama/qwen2.5:7b` (8s) → `claude-haiku-4-5-20251001` (30s) → `claude-sonnet-4-6` (30s)
|
||||
|
||||
Metryki każdego wywołania publikowane na Redis kanał `llm_router_metrics`.
|
||||
|
||||
### `services/planner-agent/src/planner.py`
|
||||
|
||||
Główna pętla agenta:
|
||||
|
||||
- **`PlannerAgent`** — async agent: Redis sub → diagnoza LLM → pending action file → event
|
||||
- **`HealthEvent`** — znormalizowane zdarzenie zdrowotne z Redis (node, service, event_type, severity, payload)
|
||||
- **`ActionProposal`** — propozycja akcji z pełnymi metadanymi; `.to_action_file()` → format executora
|
||||
- **`CooldownTracker`** — gate 5-minutowy per `svc_key` (node/service); NIE rejestruje jeśli LLM się wysypał
|
||||
- **`parse_event()`** — normalizuje dwa formaty wejściowe (node-agent / control-plane)
|
||||
- **`write_pending_action()`** — atomiczny zapis: `.tmp` → rename
|
||||
- **`emit_event()`** — zapis zdarzenia `remediation_started` do systemu plików (bez importów z control-plane)
|
||||
|
||||
Pipeline:
|
||||
```
|
||||
Redis msg → parse_event() → benign skip → cooldown gate → _propose_action() (LLM)
|
||||
→ write_pending_action() → emit_event("remediation_started")
|
||||
```
|
||||
|
||||
### Pliki towarzyszące
|
||||
|
||||
| Plik | Opis |
|
||||
|------|------|
|
||||
| `service.yaml` | Kontrakt operacyjny: owner_node=solaria, deps=redis+ollama, healthcheck=file |
|
||||
| `docker-compose.yml` | env_file + extra_hosts:host-gateway + ANTHROPIC_API_KEY w environment |
|
||||
| `Dockerfile` | python:3.11-slim, litellm, redis, jsonschema, structlog |
|
||||
| `healthcheck.sh` | Sprawdza wiek pliku heartbeat (max 300s) |
|
||||
| `requirements.txt` | litellm, redis, jsonschema, structlog |
|
||||
| `tests/test_planner.py` | 49 testów jednostkowych |
|
||||
| `tests/test_llm_router.py` | 34 testy jednostkowe |
|
||||
|
||||
---
|
||||
|
||||
## Kluczowe decyzje architektoniczne
|
||||
|
||||
### 1. HITL invariant (Human-in-the-loop)
|
||||
|
||||
Planner **wyłącznie** zapisuje do `actions/pending/`. Executor wymaga pliku w `actions/approved/`.
|
||||
Planner nigdy nie wykona akcji samodzielnie — to fundamentalna zasada systemu.
|
||||
|
||||
Implementacja: `write_pending_action()` pisze do `pending/`, żadna ścieżka w kodzie nie dotyka `approved/`.
|
||||
|
||||
### 2. Cooldown gate
|
||||
|
||||
Per `svc_key` (= `node/service`), domyślnie 5 minut. Cel: nie zalewać operatora powtórzonymi
|
||||
propozycjami dla tego samego serwisu.
|
||||
|
||||
**Kluczowa decyzja:** cooldown NIE jest rejestrowany jeśli cały chain LLM się wysypał.
|
||||
Dzięki temu kolejne zdarzenie może spróbować ponownie, zamiast być cicho zablokowanym
|
||||
przez 5 minut mimo że nie powstała żadna propozycja.
|
||||
|
||||
### 3. Fallback chain — local-first
|
||||
|
||||
Kolejność: Ollama (lokalny GPU) → Haiku → Sonnet.
|
||||
|
||||
Uzasadnienie:
|
||||
- Ollama nie wysyła danych do zewnętrznych serwisów; niskie opóźnienie dla prostych przypadków
|
||||
- Haiku = szybki i tani cloud fallback
|
||||
- Sonnet = ostatnia deska ratunku dla trudnych przypadków
|
||||
|
||||
Odrzucenie modelu na podstawie: timeout, błąd sieci, wzorzec odmowy, invalid JSON, schema error.
|
||||
|
||||
### 4. Brak importów z control-plane
|
||||
|
||||
`services/planner-agent/` jest w pełni samodzielny. Nie importuje nic z
|
||||
`services/control-plane/`. Emisja eventów jest implementowana lokalnie (kopia logiki
|
||||
`scripts/lib/events.py`).
|
||||
|
||||
Uzasadnienie: planner musi działać nawet jeśli control-plane jest offline; oddzielne
|
||||
cykl deploymentu.
|
||||
|
||||
### 5. structlog z PrintLoggerFactory
|
||||
|
||||
Nie używamy `structlog.stdlib.add_logger_name` — `PrintLogger` nie ma atrybutu `.name`.
|
||||
Zamiast tego łańcuch procesorów: `add_log_level` → `TimeStamper` → `StackInfoRenderer`
|
||||
→ `format_exc_info` → `JSONRenderer`.
|
||||
|
||||
### 6. NODE_NAME czytany w czasie wywołania, nie importu
|
||||
|
||||
`_emit_event_sync` czyta `NODE_NAME` z modułowego `NODE_NAME` przy każdym wywołaniu
|
||||
(nie jako default parameter). Umożliwia patchowanie w testach.
|
||||
|
||||
---
|
||||
|
||||
## Problemy napotkane i rozwiązania
|
||||
|
||||
### Problem: `localhost` w kontenerze nie sięga do hosta
|
||||
|
||||
**Kontekst:** Ollama działa na SOLARIA pod `localhost:11434`. Kontener Docker
|
||||
z domyślną siecią bridge nie może sięgnąć do hosta przez `localhost`.
|
||||
|
||||
**Rozwiązanie:**
|
||||
1. Dodano `extra_hosts: - "host-gateway:host-gateway"` do docker-compose.yml
|
||||
2. `.env` używa `OLLAMA_HOST=http://host-gateway:11434`
|
||||
|
||||
### Problem: `environment` vs `env_file` — podwójne zmienne
|
||||
|
||||
**Kontekst:** Pierwsza wersja docker-compose.yml miała wszystkie zmienne hardkodowane
|
||||
w sekcji `environment` z fallback wartościami (`${VAR:-default}`). Powodowało to
|
||||
że `.env` był opcjonalny a nie wymagany.
|
||||
|
||||
**Rozwiązanie:** Usunięto wszystkie zmienne runtime z `environment`, przeniesiono do `env_file`.
|
||||
Pozostał tylko `ANTHROPIC_API_KEY` w `environment` (opcjonalny sekret, nie powinien być w pliku na dysku).
|
||||
|
||||
### Problem: `structlog.stdlib.add_logger_name` crashuje z PrintLogger
|
||||
|
||||
**Symptom:** `AttributeError: 'PrintLogger' object has no attribute 'name'`
|
||||
|
||||
**Rozwiązanie:** Usunięto `add_logger_name` z łańcucha procesorów. Nie jest
|
||||
kompatybilny z `PrintLoggerFactory`.
|
||||
|
||||
### Problem: verify stage failuje zaraz po starcie
|
||||
|
||||
**Symptom:** `deploy.sh` raportuje FAILED przy verify bo heartbeat nie istnieje.
|
||||
|
||||
**Przyczyna:** Race condition — agent potrzebuje kilku sekund na uruchomienie
|
||||
pętli i pierwsze `touch()` heartbeatu.
|
||||
|
||||
**Rozwiązanie:** Nie jest to prawdziwy błąd. Docker healthcheck ma `start_period: 30s`.
|
||||
Kontener pokazuje `(healthy)` po 30s od startu.
|
||||
|
||||
### Problem: git pull z divergent branches na solaria
|
||||
|
||||
**Symptom:** Solaria miała 2 lokalne commity nie będące na Forgejo + ręczne zmiany w working tree.
|
||||
`git pull` failował z "Need to specify how to reconcile divergent branches."
|
||||
|
||||
**Rozwiązanie:**
|
||||
```bash
|
||||
git checkout -- services/planner-agent/docker-compose.yml # porzuć ręczne zmiany
|
||||
git fetch origin
|
||||
git rebase origin/master # rebase local commits on top of master
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Status deploymentu na SOLARIA
|
||||
|
||||
```
|
||||
Container: planner-agent Up ~30m (healthy)
|
||||
Image: planner-agent-planner-agent
|
||||
Node: solaria (100.100.231.104)
|
||||
Heartbeat: /opt/homelab/state/planner-agent.heartbeat (age 0s)
|
||||
|
||||
Channels subscribed:
|
||||
- health_events
|
||||
- world_updates
|
||||
|
||||
LLM chain:
|
||||
PRIMARY: ollama/qwen2.5-coder:14b @ http://host-gateway:11434
|
||||
FALLBACK: claude-haiku-4-5-20251001 (disabled — brak ANTHROPIC_API_KEY)
|
||||
FALLBACK: claude-sonnet-4-6 (disabled — brak ANTHROPIC_API_KEY)
|
||||
|
||||
Redis: redis://100.108.208.3:6379 ✓ connected
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Co zostało na później
|
||||
|
||||
### 1. ANTHROPIC_API_KEY — cloud fallback wyłączony
|
||||
|
||||
Haiku i Sonnet są skonfigurowane w chain ale nie mają klucza API.
|
||||
Gdy Ollama nie da rady (złożony przypadek / timeout), chain się wysypie bez fallbacku.
|
||||
|
||||
Aby włączyć:
|
||||
```bash
|
||||
ssh oskar@100.100.231.104
|
||||
echo "ANTHROPIC_API_KEY=sk-ant-..." >> /opt/homelab/config/planner-agent/.env
|
||||
docker compose -f ~/homelab-codex-ws/services/planner-agent/docker-compose.yml up -d
|
||||
```
|
||||
|
||||
### 2. End-to-end test z prawdziwym eventem
|
||||
|
||||
Planner jest podłączony do Redis i nasłuchuje, ale żadne zdarzenie jeszcze nie
|
||||
przeszło przez pełną ścieżkę (LLM call → pending action → operator UI).
|
||||
|
||||
Test:
|
||||
```bash
|
||||
redis-cli -h 100.108.208.3 PUBLISH health_events '{
|
||||
"type": "service_unhealthy",
|
||||
"node": "piha",
|
||||
"service": "mosquitto",
|
||||
"severity": "error",
|
||||
"payload": {"reason": "container exited"},
|
||||
"timestamp": "2026-05-27T20:00:00Z"
|
||||
}'
|
||||
# Obserwuj: docker logs planner-agent -f
|
||||
# Sprawdź: ls /opt/homelab/actions/pending/
|
||||
```
|
||||
|
||||
### 3. Solaria local commits
|
||||
|
||||
Solaria ma 2 lokalne commity (`feat: add ECC skills`, `fix: remove duplicate CLAUDE.md sections`)
|
||||
które nie są na Forgejo. Zostały zrebase'owane na top of master ale nie wypchnięte.
|
||||
Należy je wypchnąć lub zreviewować i ewentualnie squashować.
|
||||
|
||||
### 4. Integracja z operator UI / Telegram
|
||||
|
||||
Propozycje w `actions/pending/` nie mają jeszcze kanału notyfikacji do operatora.
|
||||
Telegram bot powinien wysyłać powiadomienie gdy pojawi się nowy plik w `pending/`.
|
||||
|
||||
---
|
||||
|
||||
## Commity tej sesji
|
||||
|
||||
```
|
||||
ff6fda1 planner-agent: use env_file, keep only ANTHROPIC_API_KEY in environment
|
||||
ca37fca Add planner-agent: LLM-powered remediation planner
|
||||
(llm_router.py, planner.py, tests, service.yaml, docker-compose.yml,
|
||||
healthcheck.sh, Dockerfile)
|
||||
```
|
||||
103
docs/sessions/2026-05-27.md
Normal file
103
docs/sessions/2026-05-27.md
Normal file
|
|
@ -0,0 +1,103 @@
|
|||
# SESSION: Stabilizacja systemu wieloagentowego homelabu
|
||||
|
||||
**DATE:** 2026-05-27
|
||||
**RESULT:** System NOMINAL (97/97 services, 0 errors)
|
||||
|
||||
---
|
||||
|
||||
## PROBLEMS FOUND
|
||||
|
||||
- stability-agent nie generował akcji naprawczych — tylko redeploy, brak container_restart
|
||||
- mosquitto na chelsty-infra padł i nikt go nie restartował (restart policy był `no`)
|
||||
- zigbee2mqtt nigdy nie był wdrożony na chelsty-infra
|
||||
- node-agent był pustym szkieletem — nie emitował `service_healthy`, więc `services.json` zawsze był pusty
|
||||
- ghost services: node-agent używał `c.name` (może zwrócić `<12hex>_real-name`) zamiast etykiety `com.docker.compose.service`
|
||||
- materializer na piha czytał ze swojego lokalnego Redis zamiast z control-plane API — Redis zawierał 80 przestarzałych wpisów z ghost kluczami; "Copy for AI" zwracał stare dane
|
||||
- observer używał jednego globalnego checkpointu zamiast per-node — cicho pomijał katalogi z eventami sortujące się przed aktualnym checkpointem
|
||||
- supervisor nie cancelował resolved actions — pending queue rósł bez końca
|
||||
- `service_healthy` event nie zamykał aktywnych incydentów
|
||||
- NODE_ALIAS_MAP nie był skonfigurowany — mismatch nazw nodów między eventem a topology
|
||||
- chelsty-ha błędnie w scope monitoringu — nie ma na nim node-agenta
|
||||
|
||||
---
|
||||
|
||||
## FIXES SHIPPED (commits in master)
|
||||
|
||||
```
|
||||
7277bdc Fix Copy for AI: materializer fetches from control-plane API instead of Redis
|
||||
b40b832 Fix ghost service keys from hash-prefixed Docker container names
|
||||
28e9534 observer: service_healthy resolves active incidents
|
||||
46ae92b supervisor: also cancel pending actions for services removed from desired state
|
||||
410bfe7 zigbee2mqtt: config goes in data dir (writable), not separate ro mount
|
||||
b3912fe zigbee2mqtt: use extra_hosts host-gateway instead of network_mode: host
|
||||
61e07f4 zigbee2mqtt override: clear ports list for docker-compose v1 host network compat
|
||||
51002d4 Fix pending actions: node_exporter, zigbee2mqtt, chelsty-ha monitoring
|
||||
fb7828b supervisor: auto-cancel pending actions when drift is resolved
|
||||
2f19657 fix(node-agent): unique event IDs per service to prevent same-second overwrites
|
||||
267742c vps/node-agent: add network_mode: host for control-plane health probe
|
||||
4e8968f Fix service health tracking: emit service_healthy, control-plane endpoint, checkpoint migration
|
||||
f4a8db9 fix(observer): per-node-directory checkpoints replace single global checkpoint
|
||||
a5a3e22 fix(node-agent): skip SSH config file in rsync to avoid UID ownership errors
|
||||
2349de5 fix(node-agent): correct VPS_EVENTS_HOST to actual VPS Tailscale IP
|
||||
65bac4e fix(node-agent): mount host SSH key into container for event shipping
|
||||
96bf326 fix(observer+operator-ui): fix stale world state, dict→list API, event time filter
|
||||
ae33cce feat(node-agent): add runtime overrides for piha, solaria, chelsty-infra
|
||||
c5c080b feat(vps): add node-agent runtime override with NODE_NAME=vps
|
||||
01b7758 feat(node-agent): implement health monitor and safe cleanup policy
|
||||
```
|
||||
|
||||
### Szczegóły kluczowych napraw
|
||||
|
||||
**fix(observer): per-node checkpoints**
|
||||
Jeden globalny checkpoint `last_processed_file` cicho pomijał katalogi eventów sortujące się alfabetycznie przed ostatnim przetworzonym węzłem (np. piha/ < vps/). Zastąpiony słownikiem `{"node_checkpoints": {"piha": "...", "vps": "..."}}` per-node.
|
||||
|
||||
**fix(observer): ghost key pruning**
|
||||
`_prune_stale_world()` teraz usuwa wpisy z services.json których klucz serwisu pasuje do wzorca `<12hexchars>_<name>` — artefakty z Docker internal state tracking.
|
||||
|
||||
**fix(node-agent): canonical container name**
|
||||
`check_containers()` teraz używa `com.docker.compose.service` label jako nazwy kanonicznej. Fallback: strip hash prefix z `c.name`. Kontenery w stanie `created` są pomijane (Docker stale-state artifacts).
|
||||
|
||||
**fix(node-agent): service_healthy emission**
|
||||
Node-agent teraz emituje `service_healthy` dla każdego uruchomionego zarządzanego kontenera co cykl. Bez tego `services.json` był zawsze pusty — supervisor generował flood "missing service" redeployów.
|
||||
|
||||
**fix(supervisor): auto-cancel resolved actions**
|
||||
`_cancel_resolved_pending_actions()` przenosi pending akcje do `cancelled/` gdy:
|
||||
- serwis stał się healthy (`drift_resolved_auto`)
|
||||
- serwis został usunięty z desired state (`service_removed_from_desired_state`)
|
||||
|
||||
**fix(supervisor): monitor:false**
|
||||
Pole `monitor: false` w `services.yaml` wyklucza serwis z generowania akcji supervisora. Używane dla `homeassistant` na chelsty-ha (brak node-agenta).
|
||||
|
||||
**fix(agent-system/materializer): control-plane API as source**
|
||||
Materializer na piha teraz fetchuje dane z VPS control-plane API (`CONTROL_PLANE_URL=http://100.95.58.48:18180`) zamiast z lokalnego Redis. Redis zawierał 80 przestarzałych wpisów. Redis path zachowany jako fallback.
|
||||
|
||||
**fix(chelsty-infra/zigbee2mqtt): mosquitto networking**
|
||||
Mosquitto działa z `network_mode: host` — kontenery bridge nie mogą go dosięgnąć przez localhost. Rozwiązanie: `extra_hosts: - "mosquitto:host-gateway"` w override z2m. Nie używamy `network_mode: host` dla z2m bo koliduje z `ports:` w docker-compose v1 (1.29.2 na chelsty-infra).
|
||||
|
||||
**fix(chelsty-infra/zigbee2mqtt): writable config**
|
||||
z2m migruje i nadpisuje `configuration.yaml` przy starcie. Config musi być w katalogu z danymi: `/opt/homelab/data/zigbee2mqtt/data/configuration.yaml` (read-write mount), nie w osobnym `:ro` wolumenie.
|
||||
|
||||
---
|
||||
|
||||
## STAN KOŃCOWY
|
||||
|
||||
| Node | Status | Serwisy |
|
||||
|------|--------|---------|
|
||||
| vps | online | control-plane (4), node-agent, node_exporter, stability-agent |
|
||||
| piha | online | agent-system (4), node-agent, stability-agent, monitoring stack |
|
||||
| solaria | online | node-agent, stability-agent, AI workloads |
|
||||
| chelsty-infra | online | mosquitto, zigbee2mqtt (z2m łączy się gdy SLZB-06U wróci online), node-agent, stability-agent |
|
||||
| chelsty-ha | — | homeassistant (monitor:false — brak node-agenta, HA monitorowane pośrednio przez MQTT) |
|
||||
|
||||
**Action queue:** 0 pending, 0 approved, 0 running
|
||||
**Incidents:** 0 active
|
||||
**Ghost service keys:** 0
|
||||
|
||||
---
|
||||
|
||||
## ZNANE OGRANICZENIA / TODO
|
||||
|
||||
- SLZB-06U (Zigbee coordinator) offline — `192.168.1.105:6638` EHOSTUNREACH z chelsty-infra. Prawdopodobnie problem sprzętowy/sieciowy po stronie 192.168.1.0/24. z2m startuje i serwuje stronę błędu na :8080 — połączy się automatycznie gdy coordinator wróci.
|
||||
- `ezsp` adapter w konfiguracji z2m jest deprecated — zalecana migracja do `ember`. Nie wymaga nowej konfiguracji, tylko zmiana pola `adapter: ember` w `configuration.yaml`.
|
||||
- chelsty-ha nie ma node-agenta. Dodać gdy będzie dostępna maszyna lub manual bootstrap.
|
||||
- Redis na piha nadal zawiera stare klucze `homelab:nodes:*`, `homelab:incidents:*` etc. — nie są już używane przez materializer w trybie API, można wyczyścić.
|
||||
62
docs/stability-agent-rollout.md
Normal file
62
docs/stability-agent-rollout.md
Normal file
|
|
@ -0,0 +1,62 @@
|
|||
# Stability Agent Multi-Node Rollout
|
||||
|
||||
## Architecture Summary
|
||||
The `stability-agent` is a lightweight Python service that monitors node health (disk, Docker containers, Tailscale, MQTT) and publishes state to a central Redis instance running on **PIHA**.
|
||||
|
||||
- **Source**: `services/stability-agent`
|
||||
- **State Path**: `/opt/homelab/state`
|
||||
- **Events Path**: `/opt/homelab/events`
|
||||
- **Redis Target**: `100.108.208.3:6379` (PIHA)
|
||||
|
||||
## Why UI only showed CHELSTY
|
||||
Previously, the `stability-agent` had `NODE_NAME` defaulted to `chelsty` and was only deployed there. The Agent System UI materializer on PIHA filters nodes based on the Redis keys `homelab:nodes:<NODE_NAME>`. Without other agents publishing their specific `NODE_NAME`, the UI remained limited to the single active node.
|
||||
|
||||
## Deployment
|
||||
|
||||
Use the helper script to deploy or generate commands. The script uses explicit Tailscale IPs for remote targets (piha, chelsty, vps) and runs locally for solaria.
|
||||
|
||||
```bash
|
||||
# Print commands
|
||||
./scripts/deploy/deploy-stability-agent.sh <node-name>
|
||||
|
||||
# Deploy via SSH (executes ssh oskar@<ip>)
|
||||
./scripts/deploy/deploy-stability-agent.sh <node-name> --ssh
|
||||
```
|
||||
|
||||
### Manual Steps per Node
|
||||
The manual steps are encapsulated in `services/stability-agent/deploy-local.sh`. On the target node:
|
||||
```bash
|
||||
cd /home/oskar/homelab-codex-ws
|
||||
git fetch origin
|
||||
git checkout master
|
||||
git pull origin master
|
||||
cd services/stability-agent
|
||||
./deploy-local.sh <node-name>
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
### Fleet Overview
|
||||
Run the verification script from any node with `redis-cli` access:
|
||||
```bash
|
||||
./scripts/deploy/verify-agent-fleet.sh
|
||||
```
|
||||
|
||||
### Redis Inspection (on PIHA)
|
||||
```bash
|
||||
docker exec agent-system-redis redis-cli KEYS 'homelab:nodes:*'
|
||||
docker exec agent-system-redis redis-cli HGETALL homelab:nodes:<node-name>
|
||||
```
|
||||
|
||||
Verify Web UI backend:
|
||||
```bash
|
||||
curl -s http://127.0.0.1:18180/nodes
|
||||
curl -k https://agents.okit.pl/nodes
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
- **Redis empty after compose down**: The `agent-system-redis` on PIHA uses transient storage if not configured with a volume. If it restarts, agents must republish their state (they do this automatically every `CHECK_INTERVAL`).
|
||||
- **Secrets**: `.env` files and local secrets are not committed to the repo. Ensure `MQTT_HOST` and other specific secrets are set via overrides if needed.
|
||||
- **Telegram**: Telegram bot notifications can remain disabled if `TELEGRAM_BOT_TOKEN` is absent.
|
||||
- **Docker Socket**: If the agent reports `unavailable` for Docker, ensure `/var/run/docker.sock` is mounted and the user has permissions.
|
||||
|
|
@ -49,9 +49,10 @@ Runtime state must live outside the repository to keep it immutable and clean.
|
|||
## Service Standards
|
||||
|
||||
1. **Normalization**: Every service MUST follow the `services/<service>/` layout.
|
||||
2. **Metadata**: Every service MUST have a `service.yaml` defining its operational contract.
|
||||
3. **Healthchecks**: Every service MUST have a `healthcheck.sh` for verification.
|
||||
4. **Secrets**: NEVER commit secrets to Git. Use `env.example` as a template and populate `/opt/homelab/config/<service>/.env` on the host.
|
||||
2. **Metadata**: Every service MUST have a `service.yaml` defining its operational contract. This is the primary source of truth for AI agents.
|
||||
3. **Healthchecks**: Every service MUST have a `healthcheck.sh` for verification. Agents use this to emit stability events.
|
||||
4. **Actionability**: Any automated recovery action proposed by an agent must be backed by a `service.yaml` definition.
|
||||
5. **Secrets**: NEVER commit secrets to Git. Use `env.example` as a template and populate `/opt/homelab/config/<service>/.env` on the host. Agents must treat these as "black box" configurations.
|
||||
|
||||
## Docker Compose Standards
|
||||
|
||||
|
|
|
|||
126
docs/vps-control-plane.md
Normal file
126
docs/vps-control-plane.md
Normal file
|
|
@ -0,0 +1,126 @@
|
|||
# VPS Control Plane
|
||||
|
||||
The VPS Control Plane is the orchestration brain of the homelab platform. It runs on the Hetzner VPS (Tailscale IP: `100.95.58.48`) and provides observability, automated reconciliation, and a web-based operator interface.
|
||||
|
||||
## Architecture
|
||||
|
||||
The control plane consists of four core services running as a Docker Compose stack under `services/control-plane/`:
|
||||
|
||||
| Container | Role |
|
||||
|-----------|------|
|
||||
| `control-plane-observer` | Synthesizes world state from events in `/opt/homelab/events/` |
|
||||
| `control-plane-supervisor` | Detects drift between desired state (`hosts/*/services.yaml`) and actual state (`world/services.json`); writes pending actions |
|
||||
| `control-plane-executor` | Executes approved actions from `/opt/homelab/actions/approved/` |
|
||||
| `control-plane-ui` | Web interface for system monitoring and action approval; serves port 18180 |
|
||||
|
||||
All services use **filesystem-first** semantics with `/opt/homelab/` as the data exchange layer. All four run with `network_mode: host` and as UID 1000 (`homelab` user).
|
||||
|
||||
## Supervisor Behavior
|
||||
|
||||
### Desired State
|
||||
Loaded from `hosts/*/services.yaml` each reconcile cycle. Services with `monitor: false` are silently skipped — use this for services without a node-agent (e.g. `homeassistant` on `chelsty-ha`).
|
||||
|
||||
### Drift Types
|
||||
- `missing_service` — service is in desired state but absent from `services.json`
|
||||
- `unhealthy_service` — service exists in `services.json` but `status != healthy`
|
||||
|
||||
### Action Types
|
||||
| Trigger | Action type | Risk |
|
||||
|---------|-------------|------|
|
||||
| `containers_not_running`, `mqtt_unreachable` | `container_restart` | low |
|
||||
| Any other / unknown | `redeploy` | guarded |
|
||||
| Node `disk_pressure: high` | `disk_cleanup` | guarded |
|
||||
|
||||
### Action ID Stability
|
||||
Action IDs are deterministic: `redeploy-{node}-{service}` or `container-restart-{node}-{service}`. The same drift always produces the same filename, making reconcile truly idempotent across supervisor restarts.
|
||||
|
||||
### Auto-Cancel
|
||||
Pending `redeploy` and `container_restart` actions are automatically moved to `cancelled/` when:
|
||||
- **`drift_resolved_auto`** — the service becomes `healthy` in actual state
|
||||
- **`service_removed_from_desired_state`** — the service was removed from `services.yaml` or marked `monitor: false`
|
||||
|
||||
Only `pending` actions are auto-cancelled. Approved/running actions have been committed to by the operator and are never cancelled automatically.
|
||||
|
||||
### Node Name Resolution
|
||||
The supervisor supports a `NODE_ALIAS_MAP` environment variable (JSON string) to map event/world-state node names to canonical topology names:
|
||||
|
||||
```bash
|
||||
NODE_ALIAS_MAP='{"node-2": "chelsty-infra", "node-1": "piha"}'
|
||||
```
|
||||
|
||||
## Deployment
|
||||
|
||||
### From SATURN (primary control node)
|
||||
```bash
|
||||
# Full deploy via SSH
|
||||
./scripts/deploy/deploy-control-plane.sh --ssh
|
||||
|
||||
# Or manually:
|
||||
ssh oskar@100.95.58.48 "cd ~/homelab-codex-ws && git pull origin master && cd services/control-plane && docker compose up -d --build --force-recreate"
|
||||
```
|
||||
|
||||
### Direct on VPS
|
||||
```bash
|
||||
cd ~/homelab-codex-ws/services/control-plane
|
||||
docker compose up -d --build --force-recreate
|
||||
```
|
||||
|
||||
`deploy-local.sh` also creates the required `/opt/homelab/` directory structure and sets ownership to UID 1000 (requires `sudo`). If directories already exist, skip to the `docker compose` step directly.
|
||||
|
||||
### Verification
|
||||
```bash
|
||||
# On VPS
|
||||
docker ps --filter "name=control-plane"
|
||||
curl -s http://localhost:18180/summary | python3 -m json.tool
|
||||
```
|
||||
|
||||
## Action Approval Workflow
|
||||
|
||||
```
|
||||
Supervisor writes → /opt/homelab/actions/pending/<id>.json
|
||||
→ Operator UI (port 18180) or Telegram Bot notifies
|
||||
→ Operator clicks Approve
|
||||
→ /opt/homelab/actions/approved/<id>.json
|
||||
→ Executor executes → completed / failed
|
||||
```
|
||||
|
||||
Possible action states: `pending → approved → running → completed / failed / rejected`
|
||||
Auto-cancel path: `pending → cancelled/`
|
||||
|
||||
## Recovery
|
||||
|
||||
### World state is stale or corrupt
|
||||
```bash
|
||||
# On VPS — delete checkpoint to force full replay
|
||||
rm /opt/homelab/state/observer_checkpoint.json
|
||||
docker restart control-plane-observer
|
||||
```
|
||||
|
||||
### Flood of pending actions after bootstrap
|
||||
Check if node-agent is running and emitting `service_healthy` events on each node. Without `service_healthy`, the supervisor sees all services as missing and queues redeployments every cycle.
|
||||
|
||||
```bash
|
||||
# Check node-agent on each node
|
||||
ssh oskar@<node> "docker ps --filter name=node-agent && docker logs node-agent --tail 20"
|
||||
```
|
||||
|
||||
### Rebuild from scratch
|
||||
```bash
|
||||
ssh oskar@100.95.58.48 "cd ~/homelab-codex-ws/services/control-plane && docker compose up -d --build --force-recreate"
|
||||
```
|
||||
|
||||
## Integration
|
||||
|
||||
### piha agent-system webui (port 18180 on piha)
|
||||
The `agent-system-runtime-materializer` on piha polls the VPS control-plane API every 10 seconds and mirrors world state to piha's local `/opt/homelab/world/`. This ensures the **"Copy for AI"** button in the piha webui (`agent-system-webui`) reflects the same clean state as the VPS API.
|
||||
|
||||
Override: `hosts/piha/runtime/agent-system/docker-compose.override.yml` — sets `CONTROL_PLANE_URL=http://100.95.58.48:18180`.
|
||||
|
||||
### Nginx Proxy Manager
|
||||
The operator UI at port 18180 can be proxied via NPM for external access. No WebSocket support required.
|
||||
|
||||
### Log Locations
|
||||
- Container logs: `docker compose logs -f` (from `services/control-plane/`)
|
||||
- Runtime events: `/opt/homelab/events/YYYY-MM-DD/`
|
||||
- World state: `/opt/homelab/world/`
|
||||
- Action queue: `/opt/homelab/actions/{pending,approved,running,completed,failed,cancelled}/`
|
||||
24
hosts/chelsty-ha/capabilities.yaml
Normal file
24
hosts/chelsty-ha/capabilities.yaml
Normal file
|
|
@ -0,0 +1,24 @@
|
|||
host: chelsty-ha
|
||||
site: chelsty
|
||||
|
||||
capabilities:
|
||||
networking:
|
||||
reachability: tailscale-only
|
||||
tailscale_ip: 100.122.201.23
|
||||
ingress_suitability: false
|
||||
bandwidth: LTE
|
||||
|
||||
runtime:
|
||||
container_engine: docker
|
||||
os: debian
|
||||
|
||||
operational:
|
||||
connectivity: intermittent
|
||||
availability_target: best-effort
|
||||
offline_first: true
|
||||
uplink: lte
|
||||
|
||||
deployment:
|
||||
suitability:
|
||||
- homeassistant
|
||||
restricted: false
|
||||
20
hosts/chelsty-ha/host.yaml
Normal file
20
hosts/chelsty-ha/host.yaml
Normal file
|
|
@ -0,0 +1,20 @@
|
|||
hostname: chelsty-ha
|
||||
site: chelsty
|
||||
|
||||
roles:
|
||||
- homeassistant
|
||||
|
||||
network:
|
||||
tailscale_ip: 100.122.201.23
|
||||
|
||||
runtime:
|
||||
root: /opt/homelab
|
||||
|
||||
deployment:
|
||||
mode: pull
|
||||
managed_by: saturn
|
||||
|
||||
constraints:
|
||||
connectivity:
|
||||
intermittent: true
|
||||
uplink: lte
|
||||
12
hosts/chelsty-ha/services.yaml
Normal file
12
hosts/chelsty-ha/services.yaml
Normal file
|
|
@ -0,0 +1,12 @@
|
|||
host: chelsty-ha
|
||||
site: chelsty
|
||||
|
||||
services:
|
||||
homeassistant:
|
||||
role: home-automation-controller
|
||||
offline_required: true
|
||||
# monitor: false — chelsty-ha has no node-agent deployed, so there are no
|
||||
# container-health events for the observer to track. HA is monitored
|
||||
# indirectly via the chelsty-infra MQTT broker (if MQTT goes silent, HA
|
||||
# is likely down). Re-enable once node-agent is bootstrapped on this VM.
|
||||
monitor: false
|
||||
|
|
@ -1,3 +1,6 @@
|
|||
host: chelsty-infra
|
||||
site: chelsty
|
||||
|
||||
capabilities:
|
||||
hardware:
|
||||
cpu:
|
||||
|
|
@ -8,33 +11,34 @@ capabilities:
|
|||
total_gb: 16
|
||||
acceleration:
|
||||
type: none
|
||||
|
||||
|
||||
virtualization:
|
||||
supported: true
|
||||
type: kvm
|
||||
|
||||
|
||||
storage:
|
||||
persistence: persistent
|
||||
type: ssd
|
||||
capacity_gb: 250
|
||||
|
||||
|
||||
networking:
|
||||
reachability: tailscale-only
|
||||
ingress_suitability: false
|
||||
bandwidth: LTE
|
||||
|
||||
|
||||
runtime:
|
||||
container_engine: docker
|
||||
os: debian
|
||||
|
||||
|
||||
operational:
|
||||
power_constraint: low-power
|
||||
connectivity: intermittent
|
||||
availability_target: best-effort
|
||||
|
||||
offline_operation_required: true
|
||||
|
||||
deployment:
|
||||
suitability:
|
||||
- staging
|
||||
- homeassistant
|
||||
- infra
|
||||
- edge
|
||||
restricted: false
|
||||
|
|
@ -1,9 +1,10 @@
|
|||
hostname: chelsty
|
||||
hostname: chelsty-infra
|
||||
site: chelsty
|
||||
|
||||
roles:
|
||||
- edge
|
||||
- hypervisor
|
||||
- homeassistant
|
||||
- infra
|
||||
- staging
|
||||
|
||||
network:
|
||||
|
|
@ -1,4 +1,4 @@
|
|||
host: chelsty
|
||||
host: chelsty-infra
|
||||
|
||||
uplink:
|
||||
type: lte
|
||||
|
|
@ -20,7 +20,7 @@ exposure_classes:
|
|||
|
||||
networks:
|
||||
home_automation_lan:
|
||||
purpose: Home Assistant, MQTT, Zigbee coordinator, and local device control.
|
||||
purpose: MQTT broker, Zigbee coordinator, and local device control.
|
||||
offline_required: true
|
||||
internet_required_for_core_operation: false
|
||||
|
||||
|
|
@ -1,4 +1,4 @@
|
|||
host: chelsty
|
||||
host: chelsty-infra
|
||||
|
||||
runtime_root: /opt/homelab
|
||||
|
||||
|
|
@ -9,12 +9,6 @@ conventions:
|
|||
logs: /opt/homelab/logs
|
||||
|
||||
services:
|
||||
homeassistant:
|
||||
data: /opt/homelab/data/homeassistant
|
||||
config: /opt/homelab/config/homeassistant
|
||||
logs: /opt/homelab/logs/homeassistant
|
||||
backup_priority: critical
|
||||
|
||||
zigbee2mqtt:
|
||||
data: /opt/homelab/data/zigbee2mqtt
|
||||
config: /opt/homelab/config/zigbee2mqtt
|
||||
|
|
@ -27,13 +21,13 @@ services:
|
|||
logs: /opt/homelab/logs/mosquitto
|
||||
backup_priority: high
|
||||
|
||||
backup_sets:
|
||||
homeassistant:
|
||||
include:
|
||||
- /opt/homelab/config/homeassistant
|
||||
- /opt/homelab/data/homeassistant
|
||||
restore_note: Restore before starting the Home Assistant container.
|
||||
stability-agent:
|
||||
data: /opt/homelab/state
|
||||
config: /opt/homelab/config/stability-agent
|
||||
logs: /opt/homelab/events
|
||||
backup_priority: low
|
||||
|
||||
backup_sets:
|
||||
zigbee2mqtt:
|
||||
include:
|
||||
- /opt/homelab/config/zigbee2mqtt
|
||||
88
hosts/chelsty-infra/runtime/frigate/config.yml
Normal file
88
hosts/chelsty-infra/runtime/frigate/config.yml
Normal file
|
|
@ -0,0 +1,88 @@
|
|||
# Frigate NVR — chelsty-infra
|
||||
# Hardware decode: Intel UHD 630 via VAAPI (/dev/dri/renderD128)
|
||||
# Object detection: CPU (no Coral TPU)
|
||||
# Cameras: 2x Reolink RLC-540 (5MP, WiFi)
|
||||
#
|
||||
# Required env vars in /opt/homelab/config/frigate/frigate.env:
|
||||
# CAMERA1_IP, CAMERA1_USER, CAMERA1_PASS
|
||||
# CAMERA2_IP, CAMERA2_USER, CAMERA2_PASS
|
||||
# MQTT_USER, MQTT_PASS (if mosquitto auth is enabled)
|
||||
|
||||
mqtt:
|
||||
enabled: true
|
||||
host: 127.0.0.1
|
||||
port: 1883
|
||||
# user: "{MQTT_USER}"
|
||||
# password: "{MQTT_PASS}"
|
||||
|
||||
detectors:
|
||||
cpu1:
|
||||
type: cpu
|
||||
num_threads: 3
|
||||
|
||||
ffmpeg:
|
||||
hwaccel_args: preset-vaapi
|
||||
global_args:
|
||||
- -hide_banner
|
||||
- -loglevel
|
||||
- warning
|
||||
|
||||
record:
|
||||
enabled: true
|
||||
retain:
|
||||
days: 7
|
||||
mode: all
|
||||
events:
|
||||
retain:
|
||||
default: 14
|
||||
mode: motion
|
||||
|
||||
snapshots:
|
||||
enabled: true
|
||||
retain:
|
||||
default: 7
|
||||
quality: 70
|
||||
|
||||
objects:
|
||||
track:
|
||||
- person
|
||||
- car
|
||||
- bicycle
|
||||
filters:
|
||||
person:
|
||||
min_area: 5000
|
||||
max_area: 100000
|
||||
threshold: 0.7
|
||||
|
||||
cameras:
|
||||
camera1:
|
||||
ffmpeg:
|
||||
inputs:
|
||||
# Main stream — high-res recording
|
||||
- path: rtsp://{CAMERA1_USER}:{CAMERA1_PASS}@{CAMERA1_IP}:554/h264Preview_01_main
|
||||
roles:
|
||||
- record
|
||||
# Sub stream — low-res detection (lower CPU cost)
|
||||
- path: rtsp://{CAMERA1_USER}:{CAMERA1_PASS}@{CAMERA1_IP}:554/h264Preview_01_sub
|
||||
roles:
|
||||
- detect
|
||||
detect:
|
||||
enabled: true
|
||||
width: 640
|
||||
height: 480
|
||||
fps: 5
|
||||
|
||||
camera2:
|
||||
ffmpeg:
|
||||
inputs:
|
||||
- path: rtsp://{CAMERA2_USER}:{CAMERA2_PASS}@{CAMERA2_IP}:554/h264Preview_01_main
|
||||
roles:
|
||||
- record
|
||||
- path: rtsp://{CAMERA2_USER}:{CAMERA2_PASS}@{CAMERA2_IP}:554/h264Preview_01_sub
|
||||
roles:
|
||||
- detect
|
||||
detect:
|
||||
enabled: true
|
||||
width: 640
|
||||
height: 480
|
||||
fps: 5
|
||||
25
hosts/chelsty-infra/runtime/frigate/docker-compose.yml
Normal file
25
hosts/chelsty-infra/runtime/frigate/docker-compose.yml
Normal file
|
|
@ -0,0 +1,25 @@
|
|||
services:
|
||||
frigate:
|
||||
container_name: frigate
|
||||
image: ghcr.io/blakeblackshear/frigate:stable
|
||||
restart: unless-stopped
|
||||
privileged: true
|
||||
shm_size: "256mb"
|
||||
network_mode: host
|
||||
devices:
|
||||
- /dev/dri/renderD128:/dev/dri/renderD128
|
||||
volumes:
|
||||
- /etc/localtime:/etc/localtime:ro
|
||||
- /opt/homelab/config/frigate/config.yml:/config/config.yml
|
||||
- /opt/homelab/config/frigate:/config/credentials:ro
|
||||
- /opt/homelab/data/frigate:/media/frigate
|
||||
tmpfs:
|
||||
- /tmp/cache
|
||||
env_file:
|
||||
- /opt/homelab/config/frigate/frigate.env
|
||||
healthcheck:
|
||||
test: ["CMD-SHELL", "wget -q --spider http://localhost:5000/api/version 2>&1 || exit 1"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
start_period: 60s
|
||||
|
|
@ -0,0 +1,11 @@
|
|||
services:
|
||||
node-agent:
|
||||
environment:
|
||||
- NODE_NAME=chelsty-infra
|
||||
- NODE_TYPE=lte_node
|
||||
- VPS_EVENTS_HOST=100.95.58.48
|
||||
- VPS_EVENTS_USER=oskar
|
||||
- VPS_EVENTS_PATH=/opt/homelab/events
|
||||
- CHECK_INTERVAL=60
|
||||
volumes:
|
||||
- /home/oskar/.ssh:/root/.ssh:ro
|
||||
|
|
@ -0,0 +1,12 @@
|
|||
services:
|
||||
stability-agent:
|
||||
environment:
|
||||
- NODE_NAME=chelsty-infra
|
||||
- SITE_NAME=chelsty
|
||||
- REDIS_HOST=100.108.208.3
|
||||
- REDIS_PORT=6379
|
||||
- REDIS_ENABLED=true
|
||||
- STABILITY_CHECK_INTERVAL=60
|
||||
- DISK_THRESHOLD_PCT=85
|
||||
- MQTT_HOST=mosquitto
|
||||
- MQTT_PORT=1883
|
||||
|
|
@ -0,0 +1,21 @@
|
|||
services:
|
||||
zigbee2mqtt:
|
||||
# mosquitto runs with network_mode: host on chelsty-infra.
|
||||
# extra_hosts maps the 'mosquitto' hostname to the host gateway IP so that
|
||||
# mqtt://mosquitto:1883 in configuration.yaml reaches the host-networked
|
||||
# mosquitto process. Requires Docker 20.10+ (present on chelsty-infra).
|
||||
extra_hosts:
|
||||
- "mosquitto:host-gateway"
|
||||
environment:
|
||||
- TZ=Europe/Warsaw
|
||||
healthcheck:
|
||||
test: ["CMD-SHELL", "wget -qO- http://localhost:8080 > /dev/null 2>&1 || exit 1"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
start_period: 90s
|
||||
# Note: volumes NOT overridden here.
|
||||
# The base docker-compose.yml mounts /opt/homelab/data/zigbee2mqtt/data:/app/data
|
||||
# (read-write). configuration.yaml must be placed in that directory on the node:
|
||||
# /opt/homelab/data/zigbee2mqtt/data/configuration.yaml
|
||||
# z2m rewrites this file during migrations — read-only mount is not viable.
|
||||
37
hosts/chelsty-infra/services.yaml
Normal file
37
hosts/chelsty-infra/services.yaml
Normal file
|
|
@ -0,0 +1,37 @@
|
|||
host: chelsty-infra
|
||||
site: chelsty
|
||||
|
||||
services:
|
||||
ha-diag-agent:
|
||||
role: ha-diagnostic-agent
|
||||
deployment_model: docker-compose
|
||||
exposure: local-only
|
||||
offline_required: false
|
||||
depends_on:
|
||||
local: []
|
||||
external: [homeassistant]
|
||||
config:
|
||||
target_url: http://100.70.180.90:8123 # chelsty-ha via Tailscale (HAOS, separate VM)
|
||||
location_tag: "chelsty"
|
||||
events_dir: /opt/homelab/events/chelsty-infra
|
||||
runtime:
|
||||
config_path: /opt/homelab/config/ha-diag-agent
|
||||
data_path: /var/lib/ha-diag-agent
|
||||
|
||||
node-agent:
|
||||
role: node-stability-monitor
|
||||
# LTE node: node-agent monitors and emits events but does NO Docker cleanup.
|
||||
# Disk pressure on chelsty-infra is typically Frigate recordings; Frigate's
|
||||
# own retain policy is the correct remediation, not docker prune.
|
||||
deployment_model: docker-compose
|
||||
exposure: local-only
|
||||
offline_required: true
|
||||
|
||||
mosquitto:
|
||||
role: local-mqtt-broker
|
||||
|
||||
zigbee2mqtt:
|
||||
role: zigbee-mqtt-bridge
|
||||
|
||||
frigate:
|
||||
role: nvr
|
||||
|
|
@ -1,13 +0,0 @@
|
|||
services:
|
||||
zigbee2mqtt:
|
||||
volumes:
|
||||
- ./configuration.yaml:/app/data/configuration.yaml:ro
|
||||
environment:
|
||||
- MQTT_USER=${MQTT_USER}
|
||||
- MQTT_PASSWORD=${MQTT_PASSWORD}
|
||||
# Healthcheck is already defined in base service, but we ensure compatibility
|
||||
healthcheck:
|
||||
test: ["CMD", "curl", "-f", "http://localhost:8080"]
|
||||
interval: 10s
|
||||
timeout: 5s
|
||||
retries: 3
|
||||
|
|
@ -1,108 +0,0 @@
|
|||
host: chelsty
|
||||
|
||||
exposure_classes:
|
||||
local-only:
|
||||
description: Reachable only from CHELSTY-local networks or container networks.
|
||||
public_ingress: false
|
||||
tailscale_required: false
|
||||
tailscale-internal:
|
||||
description: Reachable through the Tailscale mesh by approved tailnet clients.
|
||||
public_ingress: false
|
||||
tailscale_required: true
|
||||
public:
|
||||
description: Reachable from the public internet through an explicit ingress path.
|
||||
public_ingress: true
|
||||
tailscale_required: false
|
||||
|
||||
operational_constraints:
|
||||
uplink: lte
|
||||
connectivity: intermittent
|
||||
offline_operation_required: true
|
||||
must_not_depend_on:
|
||||
- saturn
|
||||
- vps
|
||||
- forgejo
|
||||
|
||||
services:
|
||||
homeassistant:
|
||||
role: home-automation-controller
|
||||
deployment_model: docker-compose
|
||||
exposure: tailscale-internal
|
||||
offline_required: true
|
||||
depends_on:
|
||||
local:
|
||||
- mosquitto
|
||||
- zigbee2mqtt
|
||||
external: []
|
||||
ports:
|
||||
- name: http
|
||||
container_port: 8123
|
||||
protocol: tcp
|
||||
runtime:
|
||||
config_path: /opt/homelab/config/homeassistant
|
||||
data_path: /opt/homelab/data/homeassistant
|
||||
logs_path: /opt/homelab/logs/homeassistant
|
||||
backup:
|
||||
recommended: true
|
||||
include:
|
||||
- /opt/homelab/config/homeassistant
|
||||
- /opt/homelab/data/homeassistant
|
||||
notes:
|
||||
- Back up before Home Assistant core, supervisor-equivalent, or integration upgrades.
|
||||
- Keep local restore copies on CHELSTY because LTE connectivity may be unavailable during recovery.
|
||||
|
||||
zigbee2mqtt:
|
||||
role: zigbee-mqtt-bridge
|
||||
deployment_model: docker-compose
|
||||
exposure: local-only
|
||||
offline_required: true
|
||||
depends_on:
|
||||
local:
|
||||
- mosquitto
|
||||
external:
|
||||
- slzb-06u
|
||||
coordinator:
|
||||
name: slzb-06u
|
||||
connection: network
|
||||
usb_device: null
|
||||
ports:
|
||||
- name: frontend
|
||||
container_port: 8080
|
||||
protocol: tcp
|
||||
exposure: tailscale-internal
|
||||
runtime:
|
||||
config_path: /opt/homelab/config/zigbee2mqtt
|
||||
data_path: /opt/homelab/data/zigbee2mqtt
|
||||
logs_path: /opt/homelab/logs/zigbee2mqtt
|
||||
backup:
|
||||
recommended: true
|
||||
include:
|
||||
- /opt/homelab/config/zigbee2mqtt
|
||||
- /opt/homelab/data/zigbee2mqtt
|
||||
notes:
|
||||
- Include configuration.yaml, database.db, coordinator backup files, and network key material.
|
||||
- Restore Zigbee2MQTT state together with the SLZB-06U coordinator state when replacing hardware.
|
||||
|
||||
mosquitto:
|
||||
role: local-mqtt-broker
|
||||
deployment_model: docker-compose
|
||||
exposure: local-only
|
||||
offline_required: true
|
||||
depends_on:
|
||||
local: []
|
||||
external: []
|
||||
ports:
|
||||
- name: mqtt
|
||||
container_port: 1883
|
||||
protocol: tcp
|
||||
runtime:
|
||||
config_path: /opt/homelab/config/mosquitto
|
||||
data_path: /opt/homelab/data/mosquitto
|
||||
logs_path: /opt/homelab/logs/mosquitto
|
||||
backup:
|
||||
recommended: true
|
||||
include:
|
||||
- /opt/homelab/config/mosquitto
|
||||
- /opt/homelab/data/mosquitto
|
||||
notes:
|
||||
- Retain ACL, password, persistence, and bridge configuration if enabled.
|
||||
|
|
@ -0,0 +1,8 @@
|
|||
services:
|
||||
runtime-materializer:
|
||||
environment:
|
||||
# Pull world state from the VPS control-plane API instead of local Redis.
|
||||
# The observer on VPS is the authoritative writer; mirroring its API output
|
||||
# here ensures the webui /snapshot matches the clean 97-service state that
|
||||
# the control-plane /summary endpoint serves.
|
||||
CONTROL_PLANE_URL: "http://100.95.58.48:18180"
|
||||
|
|
@ -0,0 +1,4 @@
|
|||
services:
|
||||
brain-watchdog:
|
||||
mem_limit: 64m
|
||||
restart: unless-stopped
|
||||
11
hosts/piha/runtime/node-agent/docker-compose.override.yml
Normal file
11
hosts/piha/runtime/node-agent/docker-compose.override.yml
Normal file
|
|
@ -0,0 +1,11 @@
|
|||
services:
|
||||
node-agent:
|
||||
environment:
|
||||
- NODE_NAME=piha
|
||||
- NODE_TYPE=sd_card
|
||||
- VPS_EVENTS_HOST=100.95.58.48
|
||||
- VPS_EVENTS_USER=oskar
|
||||
- VPS_EVENTS_PATH=/opt/homelab/events
|
||||
- CHECK_INTERVAL=60
|
||||
volumes:
|
||||
- /home/oskar/.ssh:/root/.ssh:ro
|
||||
|
|
@ -0,0 +1,7 @@
|
|||
services:
|
||||
stability-agent:
|
||||
environment:
|
||||
- NODE_NAME=piha
|
||||
- REDIS_HOST=100.108.208.3
|
||||
- REDIS_PORT=6379
|
||||
- REDIS_ENABLED=true
|
||||
42
hosts/piha/services.yaml
Normal file
42
hosts/piha/services.yaml
Normal file
|
|
@ -0,0 +1,42 @@
|
|||
host: piha
|
||||
|
||||
services:
|
||||
ha-diag-agent:
|
||||
role: ha-diagnostic-agent
|
||||
deployment_model: docker-compose
|
||||
exposure: local-only
|
||||
offline_required: false
|
||||
depends_on:
|
||||
local: []
|
||||
external: [homeassistant]
|
||||
config:
|
||||
target_url: http://localhost:8123
|
||||
location_tag: "ken"
|
||||
events_dir: /opt/homelab/events/piha
|
||||
runtime:
|
||||
config_path: /opt/homelab/config/ha-diag-agent
|
||||
data_path: /var/lib/ha-diag-agent
|
||||
|
||||
node-agent:
|
||||
role: node-stability-monitor
|
||||
deployment_model: docker-compose
|
||||
exposure: local-only
|
||||
offline_required: true
|
||||
depends_on:
|
||||
local: []
|
||||
external: []
|
||||
runtime:
|
||||
config_path: /opt/homelab/config/node-agent
|
||||
data_path: /opt/homelab/state
|
||||
logs_path: /opt/homelab/events
|
||||
|
||||
brain-watchdog:
|
||||
role: control-plane-watchdog
|
||||
deployment_model: docker-compose
|
||||
exposure: private
|
||||
offline_required: false
|
||||
depends_on:
|
||||
local: []
|
||||
external: [control-plane]
|
||||
runtime:
|
||||
config_path: /opt/homelab/config/brain-watchdog
|
||||
11
hosts/solaria/runtime/node-agent/docker-compose.override.yml
Normal file
11
hosts/solaria/runtime/node-agent/docker-compose.override.yml
Normal file
|
|
@ -0,0 +1,11 @@
|
|||
services:
|
||||
node-agent:
|
||||
environment:
|
||||
- NODE_NAME=solaria
|
||||
- NODE_TYPE=ai_node
|
||||
- VPS_EVENTS_HOST=100.95.58.48
|
||||
- VPS_EVENTS_USER=oskar
|
||||
- VPS_EVENTS_PATH=/opt/homelab/events
|
||||
- CHECK_INTERVAL=60
|
||||
volumes:
|
||||
- /home/oskar/.ssh:/root/.ssh:ro
|
||||
|
|
@ -0,0 +1,7 @@
|
|||
services:
|
||||
stability-agent:
|
||||
environment:
|
||||
- NODE_NAME=solaria
|
||||
- REDIS_HOST=100.108.208.3
|
||||
- REDIS_PORT=6379
|
||||
- REDIS_ENABLED=true
|
||||
15
hosts/solaria/services.yaml
Normal file
15
hosts/solaria/services.yaml
Normal file
|
|
@ -0,0 +1,15 @@
|
|||
host: solaria
|
||||
|
||||
services:
|
||||
node-agent:
|
||||
role: node-stability-monitor
|
||||
deployment_model: docker-compose
|
||||
exposure: local-only
|
||||
offline_required: true
|
||||
depends_on:
|
||||
local: []
|
||||
external: []
|
||||
runtime:
|
||||
config_path: /opt/homelab/config/node-agent
|
||||
data_path: /opt/homelab/state
|
||||
logs_path: /opt/homelab/events
|
||||
39
hosts/vps/runtime/control-plane/docker-compose.override.yml
Normal file
39
hosts/vps/runtime/control-plane/docker-compose.override.yml
Normal file
|
|
@ -0,0 +1,39 @@
|
|||
# Control-plane production overrides for the VPS deployment.
|
||||
#
|
||||
# NODE_ALIAS_MAP translates the node names that appear in raw event files
|
||||
# (written by node agents / seed scripts) to the canonical names used in
|
||||
# inventory/topology.yaml and hosts/*/services.yaml.
|
||||
#
|
||||
# Current live mapping (from /opt/homelab/events/ inspection):
|
||||
# node-2 → chelsty (zigbee2mqtt / mosquitto / homeassistant node)
|
||||
#
|
||||
# Add further entries when new nodes come online and their event-source names
|
||||
# differ from their topology names. Format is a single-line JSON object, e.g.:
|
||||
# NODE_ALIAS_MAP='{"node-2":"chelsty","node-3":"piha"}'
|
||||
#
|
||||
# The executor inherits the canonical name from the action JSON written by the
|
||||
# supervisor, so NODE_ALIAS_MAP is only required on the supervisor service.
|
||||
#
|
||||
# Memory limits: VPS has 4 GiB RAM, no swap. oom_score_adj -900 ensures the
|
||||
# host kernel OOM-killer never targets control-plane containers. mem_limit
|
||||
# provides a per-container cgroup ceiling so a leaking process is restarted by
|
||||
# Docker before it can exhaust host memory.
|
||||
|
||||
services:
|
||||
operator-ui:
|
||||
mem_limit: 192m
|
||||
oom_score_adj: -900
|
||||
|
||||
observer:
|
||||
mem_limit: 192m
|
||||
oom_score_adj: -900
|
||||
|
||||
supervisor:
|
||||
mem_limit: 400m
|
||||
oom_score_adj: -900
|
||||
environment:
|
||||
- NODE_ALIAS_MAP={"node-2":"chelsty"}
|
||||
|
||||
executor:
|
||||
mem_limit: 64m
|
||||
oom_score_adj: -900
|
||||
7
hosts/vps/runtime/control-plane/env.example
Normal file
7
hosts/vps/runtime/control-plane/env.example
Normal file
|
|
@ -0,0 +1,7 @@
|
|||
# Control Plane Environment Variables
|
||||
PORT=8080
|
||||
HOMELAB_STATE_ROOT=/opt/homelab/state
|
||||
HOMELAB_EVENTS_ROOT=/opt/homelab/events
|
||||
HOMELAB_WORLD_ROOT=/opt/homelab/world
|
||||
HOMELAB_ACTIONS_ROOT=/opt/homelab/actions
|
||||
HOMELAB_CONFIG_ROOT=/opt/homelab/config
|
||||
16
hosts/vps/runtime/node-agent/docker-compose.override.yml
Normal file
16
hosts/vps/runtime/node-agent/docker-compose.override.yml
Normal file
|
|
@ -0,0 +1,16 @@
|
|||
services:
|
||||
node-agent:
|
||||
environment:
|
||||
- NODE_NAME=vps
|
||||
- CHECK_INTERVAL=60
|
||||
# host network mode: node-agent on VPS shares the host's network namespace
|
||||
# so that localhost:18180 resolves to the control-plane's exposed port.
|
||||
# Without this, localhost inside the container is the container's own loopback
|
||||
# and the _check_control_plane_health() probe would always fail.
|
||||
network_mode: host
|
||||
# HARD memory ceiling: node-agent mounts /opt/homelab/events/ (page cache)
|
||||
# and may accumulate Python RSS over hours; 640m cap ensures it is killed and
|
||||
# auto-restarted by Docker before consuming host memory. oom_score_adj -900
|
||||
# prevents the host kernel OOM-killer from picking it as a global victim.
|
||||
mem_limit: 640m
|
||||
oom_score_adj: -900
|
||||
|
|
@ -0,0 +1,9 @@
|
|||
services:
|
||||
stability-agent:
|
||||
environment:
|
||||
- NODE_NAME=vps
|
||||
- REDIS_HOST=100.108.208.3
|
||||
- REDIS_PORT=6379
|
||||
- REDIS_ENABLED=true
|
||||
mem_limit: 96m
|
||||
oom_score_adj: -900
|
||||
|
|
@ -1 +0,0 @@
|
|||
npm
|
||||
43
hosts/vps/services.yaml
Normal file
43
hosts/vps/services.yaml
Normal file
|
|
@ -0,0 +1,43 @@
|
|||
host: vps
|
||||
|
||||
services:
|
||||
node-agent:
|
||||
role: node-stability-monitor
|
||||
deployment_model: docker-compose
|
||||
exposure: local-only
|
||||
offline_required: true
|
||||
depends_on:
|
||||
local: []
|
||||
external: []
|
||||
runtime:
|
||||
config_path: /opt/homelab/config/node-agent
|
||||
data_path: /opt/homelab/state
|
||||
logs_path: /opt/homelab/events
|
||||
|
||||
control-plane:
|
||||
role: management-and-orchestration
|
||||
deployment_model: docker-compose
|
||||
exposure: tailscale-internal
|
||||
offline_required: false
|
||||
depends_on:
|
||||
local:
|
||||
- node-agent
|
||||
external:
|
||||
- piha:redis
|
||||
ports:
|
||||
- name: http
|
||||
container_port: 18180
|
||||
protocol: tcp
|
||||
runtime:
|
||||
config_path: /opt/homelab/config/control-plane
|
||||
data_path: /opt/homelab/data/control-plane
|
||||
logs_path: /opt/homelab/logs/control-plane
|
||||
|
||||
node_exporter:
|
||||
role: metrics-exporter
|
||||
deployment_model: docker-compose
|
||||
exposure: local-only
|
||||
offline_required: true
|
||||
depends_on:
|
||||
local: []
|
||||
external: []
|
||||
|
|
@ -17,6 +17,10 @@ nodes:
|
|||
roles:
|
||||
- infra
|
||||
- monitoring
|
||||
services:
|
||||
- node-agent
|
||||
- ha-diag-agent
|
||||
- brain-watchdog
|
||||
|
||||
solaria:
|
||||
roles:
|
||||
|
|
@ -27,12 +31,25 @@ nodes:
|
|||
roles:
|
||||
- edge
|
||||
- ingress
|
||||
- control-plane
|
||||
services:
|
||||
# Repo-managed GitOps services (hosts/vps/services.yaml is authoritative)
|
||||
- node-agent
|
||||
- control-plane # executor, observer, supervisor, operator-ui
|
||||
- node_exporter
|
||||
- stability-agent
|
||||
- npm # Nginx Proxy Manager — public ingress, TLS termination
|
||||
- outline # Team wiki (outline + postgres + redis)
|
||||
- joplin # Note sync server (joplin-server + postgres)
|
||||
- ai-cluster # AI workers: codex-worker, openclaw, planner-worker,
|
||||
# service-ops-worker, redis, mosquitto
|
||||
|
||||
chelsty:
|
||||
chelsty-infra:
|
||||
site: chelsty
|
||||
roles:
|
||||
- remote
|
||||
- hypervisor
|
||||
- homeassistant
|
||||
- infra
|
||||
- staging
|
||||
connectivity:
|
||||
uplink: lte
|
||||
|
|
@ -40,10 +57,22 @@ nodes:
|
|||
home_automation:
|
||||
offline_operation_required: true
|
||||
services:
|
||||
- homeassistant
|
||||
- zigbee2mqtt
|
||||
- mosquitto
|
||||
coordinator:
|
||||
model: SLZB-06U
|
||||
connection: network
|
||||
usb: false
|
||||
|
||||
chelsty-ha:
|
||||
site: chelsty
|
||||
roles:
|
||||
- remote
|
||||
- homeassistant
|
||||
connectivity:
|
||||
uplink: lte
|
||||
intermittent: true
|
||||
home_automation:
|
||||
offline_operation_required: true
|
||||
services:
|
||||
- homeassistant
|
||||
|
|
|
|||
75
scripts/bootstrap/vps-control-plane.sh
Executable file
75
scripts/bootstrap/vps-control-plane.sh
Executable file
|
|
@ -0,0 +1,75 @@
|
|||
#!/usr/bin/env bash
|
||||
# vps-control-plane.sh - Bootstrap script for VPS control plane
|
||||
|
||||
set -e
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
|
||||
RUNTIME_DIR="/opt/homelab"
|
||||
VPS_CONFIG="$REPO_ROOT/hosts/vps/runtime"
|
||||
|
||||
# Colors for output
|
||||
RED='\033[0;31m'
|
||||
GREEN='\033[0;32m'
|
||||
YELLOW='\033[1;33m'
|
||||
NC='\033[0m' # No Color
|
||||
|
||||
log() { echo -e "${GREEN}[INFO]${NC} $1"; }
|
||||
warn() { echo -e "${YELLOW}[WARN]${NC} $1"; }
|
||||
error() { echo -e "${RED}[ERROR]${NC} $1"; exit 1; }
|
||||
|
||||
log "Starting VPS control plane bootstrap..."
|
||||
|
||||
# 1. Validate Docker availability
|
||||
if ! command -v docker &> /dev/null; then
|
||||
error "Docker is not installed. Please install Docker first."
|
||||
fi
|
||||
|
||||
# 2. Validate compose plugin
|
||||
if ! docker compose version &> /dev/null; then
|
||||
error "Docker Compose plugin is not installed."
|
||||
fi
|
||||
|
||||
log "Docker and Compose plugin verified."
|
||||
|
||||
# 3. Create filesystem-first runtime structure
|
||||
log "Creating filesystem-first runtime structure in $RUNTIME_DIR..."
|
||||
sudo mkdir -p "$RUNTIME_DIR/events" \
|
||||
"$RUNTIME_DIR/state" \
|
||||
"$RUNTIME_DIR/world" \
|
||||
"$RUNTIME_DIR/actions/pending" \
|
||||
"$RUNTIME_DIR/actions/approved" \
|
||||
"$RUNTIME_DIR/actions/running" \
|
||||
"$RUNTIME_DIR/actions/completed" \
|
||||
"$RUNTIME_DIR/actions/failed" \
|
||||
"$RUNTIME_DIR/actions/rejected" \
|
||||
"$RUNTIME_DIR/config" \
|
||||
"$RUNTIME_DIR/logs"
|
||||
|
||||
# 4. Set permissions
|
||||
log "Setting permissions..."
|
||||
sudo chown -R $USER:$USER "$RUNTIME_DIR"
|
||||
chmod -R 755 "$RUNTIME_DIR"
|
||||
|
||||
# 5. Install environment file
|
||||
log "Installing environment configuration..."
|
||||
if [ ! -f "$RUNTIME_DIR/config/control-plane.env" ]; then
|
||||
cp "$VPS_CONFIG/control-plane/env.example" "$RUNTIME_DIR/config/control-plane.env"
|
||||
log "Created $RUNTIME_DIR/config/control-plane.env from template."
|
||||
else
|
||||
warn "Environment file already exists, skipping installation."
|
||||
fi
|
||||
|
||||
# 6. Build and start the control plane
|
||||
log "Building and starting control plane services..."
|
||||
cd "$REPO_ROOT/services/control-plane"
|
||||
docker compose build
|
||||
docker compose up -d
|
||||
|
||||
log "VPS control plane bootstrap complete!"
|
||||
|
||||
echo -e "\n${YELLOW}Verification commands:${NC}"
|
||||
echo "1. Check container status: docker compose ps"
|
||||
echo "2. Check operator UI: curl http://localhost:8080/summary"
|
||||
echo "3. Validate world state: ls -l $RUNTIME_DIR/world"
|
||||
echo "4. Monitor events: tail -f $RUNTIME_DIR/events/*/*/*.json"
|
||||
23
scripts/deploy/deploy-control-plane.sh
Executable file
23
scripts/deploy/deploy-control-plane.sh
Executable file
|
|
@ -0,0 +1,23 @@
|
|||
#!/bin/bash
|
||||
# scripts/deploy/deploy-control-plane.sh
|
||||
set -e
|
||||
|
||||
VPS_IP="100.95.58.48"
|
||||
USER="oskar"
|
||||
REMOTE_REPO_PATH="/home/oskar/homelab-codex-ws"
|
||||
|
||||
MODE=$1
|
||||
|
||||
case "$MODE" in
|
||||
"--ssh")
|
||||
echo "Deploying to VPS ($VPS_IP) via SSH..."
|
||||
ssh -t "$USER@$VPS_IP" "cd $REMOTE_REPO_PATH && git pull origin master && cd services/control-plane && bash deploy-local.sh"
|
||||
;;
|
||||
"--print")
|
||||
echo "ssh -t $USER@$VPS_IP \"cd $REMOTE_REPO_PATH && git pull origin master && cd services/control-plane && bash deploy-local.sh\""
|
||||
;;
|
||||
*)
|
||||
echo "Usage: $0 [--ssh|--print]"
|
||||
exit 1
|
||||
;;
|
||||
esac
|
||||
26
scripts/deploy/deploy-frigate.sh
Executable file
26
scripts/deploy/deploy-frigate.sh
Executable file
|
|
@ -0,0 +1,26 @@
|
|||
#!/usr/bin/env bash
|
||||
# deploy-frigate.sh - Deploy Frigate NVR on chelsty-infra (print or SSH)
|
||||
|
||||
MODE="print"
|
||||
[[ "$1" == "--ssh" ]] && MODE="ssh"
|
||||
|
||||
TARGET="100.122.201.22"
|
||||
NODE="chelsty-infra"
|
||||
REPO_PATH="/home/oskar/homelab-codex-ws"
|
||||
SERVICE_PATH="$REPO_PATH/hosts/chelsty-infra/runtime/frigate"
|
||||
|
||||
echo "HOST: $NODE"
|
||||
echo "MODE: $MODE"
|
||||
echo "TARGET: $TARGET"
|
||||
|
||||
# Secrets must exist at /opt/homelab/config/frigate/frigate.env on the node
|
||||
# before first deploy. See config.yml for required variables.
|
||||
DEPLOY_CMD="cd $REPO_PATH && git fetch origin && git checkout master && git pull origin master && cd $SERVICE_PATH && docker-compose pull && docker-compose up -d"
|
||||
|
||||
if [[ "$MODE" == "ssh" ]]; then
|
||||
echo "--- Deploying Frigate to $NODE ($TARGET) via SSH ---"
|
||||
ssh oskar@$TARGET "$DEPLOY_CMD"
|
||||
else
|
||||
echo "# --- Deployment commands for $NODE ---"
|
||||
echo "ssh oskar@$TARGET '$DEPLOY_CMD'"
|
||||
fi
|
||||
|
|
@ -8,6 +8,7 @@ set -e
|
|||
REPO_PATH="${HOME}/homelab-codex-ws"
|
||||
RUNTIME_PATH="/opt/homelab"
|
||||
HOSTNAME=$(hostname | tr '[:lower:]' '[:upper:]')
|
||||
HOST_DIR="${REPO_PATH}/hosts/$(hostname | tr '[:upper:]' '[:lower:]')"
|
||||
|
||||
echo "--- Starting Deployment on ${HOSTNAME} ---"
|
||||
|
||||
|
|
@ -22,37 +23,47 @@ echo "Pulling latest changes..."
|
|||
git pull
|
||||
|
||||
# 2. Identify Services
|
||||
# Based on our convention, we look for services assigned to this host
|
||||
# For now, we'll check if a 'services.txt' exists in the host folder
|
||||
SERVICE_LIST="${REPO_PATH}/hosts/$(hostname | tr '[:upper:]' '[:lower:]')/services.txt"
|
||||
SERVICES=()
|
||||
if [ -f "${HOST_DIR}/services.txt" ]; then
|
||||
mapfile -t SERVICES < <(grep -v '^\s*#' "${HOST_DIR}/services.txt" | grep -v '^\s*$')
|
||||
elif [ -f "${HOST_DIR}/services.yaml" ]; then
|
||||
SERVICES=($(python3 -c "
|
||||
import yaml, sys
|
||||
try:
|
||||
with open('${HOST_DIR}/services.yaml', 'r') as f:
|
||||
data = yaml.safe_load(f)
|
||||
if data and 'services' in data:
|
||||
if isinstance(data['services'], dict):
|
||||
print(' '.join(data['services'].keys()))
|
||||
elif isinstance(data['services'], list):
|
||||
print(' '.join(data['services']))
|
||||
except Exception as e:
|
||||
print(f'Error parsing YAML: {e}', file=sys.stderr)
|
||||
sys.exit(1)
|
||||
"))
|
||||
fi
|
||||
|
||||
if [ ! -f "$SERVICE_LIST" ]; then
|
||||
echo "No services.txt found for ${HOSTNAME}. Skipping service deployment."
|
||||
if [ ${#SERVICES[@]} -eq 0 ]; then
|
||||
echo "No services found for ${HOSTNAME}. Skipping service deployment."
|
||||
exit 0
|
||||
fi
|
||||
|
||||
# 3. Deploy Services
|
||||
while IFS= read -r service || [ -n "$service" ]; do
|
||||
[[ "$service" =~ ^#.*$ ]] && continue # Skip comments
|
||||
[[ -z "$service" ]] && continue # Skip empty lines
|
||||
|
||||
for service in "${SERVICES[@]}"; do
|
||||
echo "Deploying service: ${service}..."
|
||||
|
||||
|
||||
COMPOSE_FILE="${REPO_PATH}/services/${service}/docker-compose.yml"
|
||||
|
||||
|
||||
if [ ! -f "$COMPOSE_FILE" ]; then
|
||||
echo "Warning: Compose file not found for ${service} at ${COMPOSE_FILE}"
|
||||
continue
|
||||
fi
|
||||
|
||||
# Target directory in runtime
|
||||
TARGET_DIR="${RUNTIME_PATH}/services/${service}"
|
||||
mkdir -p "$TARGET_DIR"
|
||||
|
||||
# We use the compose file from the repo directly
|
||||
# but we can also handle overrides here
|
||||
OVERRIDE_FILE="${RUNTIME_PATH}/config/${service}/docker-compose.override.yml"
|
||||
|
||||
OVERRIDE_FILE="${HOST_DIR}/runtime/${service}/docker-compose.override.yml"
|
||||
|
||||
COMPOSE_CMD="docker compose -f ${COMPOSE_FILE}"
|
||||
if [ -f "$OVERRIDE_FILE" ]; then
|
||||
echo "Using override file for ${service}"
|
||||
|
|
@ -60,7 +71,6 @@ while IFS= read -r service || [ -n "$service" ]; do
|
|||
fi
|
||||
|
||||
$COMPOSE_CMD up -d --remove-orphans
|
||||
|
||||
done < "$SERVICE_LIST"
|
||||
done
|
||||
|
||||
echo "--- Deployment Complete ---"
|
||||
|
|
|
|||
55
scripts/deploy/deploy-stability-agent.sh
Executable file
55
scripts/deploy/deploy-stability-agent.sh
Executable file
|
|
@ -0,0 +1,55 @@
|
|||
#!/usr/bin/env bash
|
||||
# deploy-stability-agent.sh - Helper to deploy stability-agent (print or SSH)
|
||||
|
||||
NODE=$1
|
||||
MODE="print"
|
||||
[[ "$2" == "--ssh" ]] && MODE="ssh"
|
||||
|
||||
if [[ -z "$NODE" ]]; then
|
||||
echo "Usage: $0 <node-name> [--ssh]"
|
||||
echo "Supported nodes: chelsty, piha, solaria, vps"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
case "$NODE" in
|
||||
piha) TARGET="100.108.208.3" ;;
|
||||
chelsty) TARGET="100.122.201.22" ;;
|
||||
vps) TARGET="100.95.58.48" ;;
|
||||
solaria) TARGET="local" ;;
|
||||
*)
|
||||
echo "Error: Unknown node '$NODE'"
|
||||
echo "Supported nodes: chelsty, piha, solaria, vps"
|
||||
exit 1
|
||||
;;
|
||||
esac
|
||||
|
||||
echo "HOST: $NODE"
|
||||
echo "MODE: $MODE"
|
||||
echo "TARGET: $TARGET"
|
||||
|
||||
REPO_PATH="/home/oskar/homelab-codex-ws"
|
||||
|
||||
if [[ "$NODE" == "solaria" ]]; then
|
||||
if [[ "$MODE" == "ssh" ]]; then
|
||||
echo "--- Running local deployment for solaria ---"
|
||||
cd "$REPO_PATH" && git fetch origin && git checkout master && git pull origin master && cd services/stability-agent && ./deploy-local.sh solaria
|
||||
else
|
||||
echo "# --- Deployment commands for solaria ---"
|
||||
echo "cd $REPO_PATH"
|
||||
echo "git fetch origin"
|
||||
echo "git checkout master"
|
||||
echo "git pull origin master"
|
||||
echo "cd services/stability-agent"
|
||||
echo "./deploy-local.sh solaria"
|
||||
fi
|
||||
else
|
||||
# Remote nodes
|
||||
SSH_CMD="ssh oskar@$TARGET 'cd $REPO_PATH && git fetch origin && git checkout master && git pull origin master && cd services/stability-agent && ./deploy-local.sh $NODE'"
|
||||
if [[ "$MODE" == "ssh" ]]; then
|
||||
echo "--- Deploying to $NODE ($TARGET) via SSH ---"
|
||||
eval "$SSH_CMD"
|
||||
else
|
||||
echo "# --- Deployment commands for $NODE ---"
|
||||
echo "$SSH_CMD"
|
||||
fi
|
||||
fi
|
||||
|
|
@ -1,270 +1,321 @@
|
|||
#!/usr/bin/env bash
|
||||
# deploy.sh - Staged deployment framework for homelab nodes.
|
||||
# scripts/deploy/deploy.sh — Saturn-side deploy dispatcher
|
||||
# Usage: deploy.sh <target> [--dry-run] [--no-gate]
|
||||
# target ∈ {control-plane, vps, piha, solaria, chelsty-infra}
|
||||
# Exit codes: 0=ok 1=preflight 2=gate 3=execute 4=verify 5=handoff(sudo)
|
||||
|
||||
set -o pipefail
|
||||
set -uo pipefail
|
||||
|
||||
# --- Configuration ---
|
||||
export RUNTIME_PATH="/opt/homelab"
|
||||
export STATE_DIR="${RUNTIME_PATH}/state/deploy"
|
||||
export LOG_DIR="${RUNTIME_PATH}/logs/deploy"
|
||||
export REPO_PATH="${HOME}/homelab-codex-ws"
|
||||
export TIMESTAMP=$(date +%Y%m%d_%H%M%S)
|
||||
export LOG_FILE="${LOG_DIR}/deploy_${TIMESTAMP}.log"
|
||||
REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
|
||||
SSH_USER="${SSH_USER:-oskar}"
|
||||
START_TIME=$(date +%s)
|
||||
TARGET=""
|
||||
DRY_RUN=false
|
||||
NO_GATE=false
|
||||
|
||||
# --- Initialization ---
|
||||
mkdir -p "$STATE_DIR" "$LOG_DIR"
|
||||
usage() {
|
||||
cat >&2 <<'EOF'
|
||||
Usage: deploy.sh <target> [--dry-run] [--no-gate]
|
||||
|
||||
# Redirection for logging
|
||||
exec > >(tee -a "$LOG_FILE") 2>&1
|
||||
Targets:
|
||||
control-plane observer/supervisor/executor/operator-ui on VPS
|
||||
vps all VPS GitOps services
|
||||
piha PIHA services
|
||||
solaria SOLARIA compute services
|
||||
chelsty-infra CHELSTY edge node (LTE, longer SSH timeout)
|
||||
|
||||
# --- Load Libraries ---
|
||||
LIB_PATH="${REPO_PATH}/scripts/lib"
|
||||
source "${LIB_PATH}/log.sh"
|
||||
source "${LIB_PATH}/state.sh"
|
||||
source "${LIB_PATH}/inventory.sh"
|
||||
source "${LIB_PATH}/compose.sh"
|
||||
source "${LIB_PATH}/diagnostics.sh"
|
||||
Flags:
|
||||
--dry-run run preflight + gate only; stop before deploy
|
||||
--no-gate skip pytest + docker build (emergency only; logged as WARNING)
|
||||
|
||||
# --- CLI Parsing ---
|
||||
TARGET_HOST=$(hostname)
|
||||
TARGET_SERVICE=""
|
||||
RESUME=false
|
||||
REQUESTED_STAGE=""
|
||||
Exit codes: 0=ok 1=preflight 2=gate 3=execute 4=verify 5=handoff(sudo)
|
||||
EOF
|
||||
exit 1
|
||||
}
|
||||
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case $1 in
|
||||
--host)
|
||||
TARGET_HOST="$2"
|
||||
shift 2
|
||||
;;
|
||||
--service)
|
||||
TARGET_SERVICE="$2"
|
||||
shift 2
|
||||
;;
|
||||
--resume)
|
||||
RESUME=true
|
||||
shift
|
||||
;;
|
||||
--stage)
|
||||
REQUESTED_STAGE="$2"
|
||||
shift 2
|
||||
;;
|
||||
control-plane|vps|piha|solaria|chelsty-infra)
|
||||
TARGET="$1"; shift ;;
|
||||
--dry-run)
|
||||
DRY_RUN=true; shift ;;
|
||||
--no-gate)
|
||||
NO_GATE=true; shift ;;
|
||||
-h|--help)
|
||||
usage ;;
|
||||
*)
|
||||
if [[ "$1" =~ ^(prepare|validate|deploy|verify|diagnose|complete)$ ]]; then
|
||||
REQUESTED_STAGE="$1"
|
||||
fi
|
||||
shift
|
||||
;;
|
||||
echo "Unknown argument: $1" >&2
|
||||
usage ;;
|
||||
esac
|
||||
done
|
||||
|
||||
# --- Stages ---
|
||||
[[ -z "$TARGET" ]] && { echo "Error: target is required." >&2; usage; }
|
||||
|
||||
stage_prepare() {
|
||||
local host=$1
|
||||
if is_stage_complete "prepare" && [[ "$RESUME" == "true" ]]; then
|
||||
log "INFO" "Skipping PREPARE (already complete)"
|
||||
case "$TARGET" in
|
||||
control-plane) SSH_HOST="vps" ;;
|
||||
*) SSH_HOST="$TARGET" ;;
|
||||
esac
|
||||
|
||||
case "$TARGET" in
|
||||
chelsty-*) SSH_TIMEOUT=30 ;;
|
||||
*) SSH_TIMEOUT=5 ;;
|
||||
esac
|
||||
|
||||
# ── PREFLIGHT ────────────────────────────────────────────────────────────────
|
||||
|
||||
preflight() {
|
||||
echo "=== PREFLIGHT ==="
|
||||
|
||||
local branch
|
||||
branch=$(git -C "$REPO_ROOT" rev-parse --abbrev-ref HEAD)
|
||||
if [[ "$branch" != "master" ]]; then
|
||||
echo "ERROR: On branch '${branch}', not master. Switch to master and push first." >&2
|
||||
exit 1
|
||||
fi
|
||||
echo "[ok] branch: master"
|
||||
|
||||
if ! git -C "$REPO_ROOT" diff --quiet; then
|
||||
echo "ERROR: Unstaged changes in working tree. Commit or stash before deploying." >&2
|
||||
exit 1
|
||||
fi
|
||||
if ! git -C "$REPO_ROOT" diff --cached --quiet; then
|
||||
echo "ERROR: Staged but uncommitted changes. Commit before deploying." >&2
|
||||
exit 1
|
||||
fi
|
||||
echo "[ok] working tree clean"
|
||||
|
||||
git -C "$REPO_ROOT" fetch origin master --quiet
|
||||
local unpushed
|
||||
unpushed=$(git -C "$REPO_ROOT" log origin/master..HEAD --oneline)
|
||||
if [[ -n "$unpushed" ]]; then
|
||||
echo "ERROR: Unpushed commits on master:" >&2
|
||||
echo "$unpushed" >&2
|
||||
echo "Push first: git push origin master" >&2
|
||||
exit 1
|
||||
fi
|
||||
echo "[ok] no unpushed commits"
|
||||
|
||||
echo "Checking SSH: ${SSH_USER}@${SSH_HOST} (ConnectTimeout=${SSH_TIMEOUT}s)..."
|
||||
if ! ssh -o "ConnectTimeout=${SSH_TIMEOUT}" -o BatchMode=yes \
|
||||
"${SSH_USER}@${SSH_HOST}" true 2>/dev/null; then
|
||||
echo "ERROR: Cannot reach ${SSH_HOST} via SSH (timeout ${SSH_TIMEOUT}s)." >&2
|
||||
exit 1
|
||||
fi
|
||||
echo "[ok] ${SSH_HOST} reachable"
|
||||
}
|
||||
|
||||
# ── GATE ─────────────────────────────────────────────────────────────────────
|
||||
|
||||
gate() {
|
||||
if [[ "$NO_GATE" == "true" ]]; then
|
||||
echo "=== GATE: SKIPPED ==="
|
||||
echo "WARNING: --no-gate active — pytest + docker build bypassed (emergency mode)." >&2
|
||||
return 0
|
||||
fi
|
||||
|
||||
log "INFO" "Stage: PREPARE ($host)"
|
||||
set_stage "prepare"
|
||||
|
||||
emit_event "deployment_started" "info" "deploy.sh" "all" "${TIMESTAMP}" "{\"stage\": \"prepare\"}"
|
||||
echo "=== GATE ==="
|
||||
|
||||
cd "$REPO_PATH" || exit 1
|
||||
log "INFO" "Pulling latest changes..."
|
||||
if ! git pull; then
|
||||
log "WARN" "Git pull failed, proceeding with local state (offline mode or network flap)"
|
||||
fi
|
||||
local services=()
|
||||
|
||||
# Ensure runtime directories exist
|
||||
mkdir -p "${RUNTIME_PATH}/config" "${RUNTIME_PATH}/data" "${RUNTIME_PATH}/state" "${RUNTIME_PATH}/logs"
|
||||
|
||||
struct_log "prepare" "$host" "all" "success" "repo_updated"
|
||||
mark_stage_complete "prepare"
|
||||
}
|
||||
|
||||
stage_validate() {
|
||||
local host=$1
|
||||
if is_stage_complete "validate" && [[ "$RESUME" == "true" ]]; then
|
||||
log "INFO" "Skipping VALIDATE (already complete)"
|
||||
return 0
|
||||
fi
|
||||
|
||||
log "INFO" "Stage: VALIDATE ($host)"
|
||||
set_stage "validate"
|
||||
|
||||
for service in "${SERVICES[@]}"; do
|
||||
log "INFO" "Validating $service..."
|
||||
if [[ ! -d "${REPO_PATH}/services/$service" ]]; then
|
||||
log "ERROR" "Service definition not found: $service"
|
||||
struct_log "validate" "$host" "$service" "fail" "not_found"
|
||||
return 1
|
||||
fi
|
||||
done
|
||||
|
||||
struct_log "validate" "$host" "all" "success" "validated"
|
||||
mark_stage_complete "validate"
|
||||
}
|
||||
|
||||
stage_deploy() {
|
||||
local host=$1
|
||||
if is_stage_complete "deploy" && [[ "$RESUME" == "true" ]]; then
|
||||
log "INFO" "Skipping DEPLOY (already complete)"
|
||||
return 0
|
||||
fi
|
||||
|
||||
log "INFO" "Stage: DEPLOY ($host)"
|
||||
set_stage "deploy"
|
||||
|
||||
local last_s=$(get_last_service)
|
||||
local skip=false
|
||||
if [[ "$RESUME" == "true" && -n "$last_s" ]]; then
|
||||
skip=true
|
||||
fi
|
||||
|
||||
for service in "${SERVICES[@]}"; do
|
||||
if [[ "$skip" == "true" ]]; then
|
||||
if [[ "$service" == "$last_s" ]]; then
|
||||
skip=false
|
||||
log "INFO" "Resuming from $service..."
|
||||
else
|
||||
log "INFO" "Skipping $service (already processed)"
|
||||
continue
|
||||
fi
|
||||
fi
|
||||
|
||||
log "INFO" "Deploying $service..."
|
||||
set_last_service "$service"
|
||||
|
||||
if ! run_compose_up "$service"; then
|
||||
struct_log "deploy" "$host" "$service" "fail" "docker_compose_failed"
|
||||
collect_diagnostics "$host" "$service"
|
||||
return 1
|
||||
fi
|
||||
|
||||
struct_log "deploy" "$host" "$service" "success" "deployed"
|
||||
done
|
||||
|
||||
set_last_service ""
|
||||
mark_stage_complete "deploy"
|
||||
}
|
||||
|
||||
stage_verify() {
|
||||
local host=$1
|
||||
if is_stage_complete "verify" && [[ "$RESUME" == "true" ]]; then
|
||||
log "INFO" "Skipping VERIFY (already complete)"
|
||||
return 0
|
||||
fi
|
||||
|
||||
log "INFO" "Stage: VERIFY ($host)"
|
||||
set_stage "verify"
|
||||
|
||||
for service in "${SERVICES[@]}"; do
|
||||
log "INFO" "Verifying $service..."
|
||||
local health_script="${REPO_PATH}/services/${service}/healthcheck.sh"
|
||||
if [[ -f "$health_script" ]]; then
|
||||
if ! bash "$health_script"; then
|
||||
log "ERROR" "Healthcheck failed for $service"
|
||||
struct_log "verify" "$host" "$service" "fail" "healthcheck_failed"
|
||||
collect_diagnostics "$host" "$service"
|
||||
return 1
|
||||
fi
|
||||
else
|
||||
# Generic check if container is running
|
||||
if ! docker ps --filter "name=$service" --filter "status=running" | grep -q "$service"; then
|
||||
log "ERROR" "Container $service is not running"
|
||||
struct_log "verify" "$host" "$service" "fail" "container_not_running"
|
||||
collect_diagnostics "$host" "$service"
|
||||
return 1
|
||||
fi
|
||||
fi
|
||||
struct_log "verify" "$host" "$service" "success" "verified"
|
||||
done
|
||||
mark_stage_complete "verify"
|
||||
}
|
||||
|
||||
stage_complete() {
|
||||
local host=$1
|
||||
log "INFO" "Stage: COMPLETE ($host)"
|
||||
set_stage "complete"
|
||||
struct_log "complete" "$host" "all" "success" "deployment_finished"
|
||||
clear_deployment_state
|
||||
}
|
||||
|
||||
# --- Execution Logic ---
|
||||
|
||||
run_deployment() {
|
||||
local start_stage=$1
|
||||
|
||||
# Sequential execution from start_stage
|
||||
case "$start_stage" in
|
||||
prepare)
|
||||
stage_prepare "$TARGET_HOST" || return 1
|
||||
;&
|
||||
validate)
|
||||
stage_validate "$TARGET_HOST" || return 1
|
||||
;&
|
||||
deploy)
|
||||
stage_deploy "$TARGET_HOST" || return 1
|
||||
;&
|
||||
verify)
|
||||
stage_verify "$TARGET_HOST" || return 1
|
||||
;&
|
||||
complete)
|
||||
stage_complete "$TARGET_HOST" || return 1
|
||||
;;
|
||||
*)
|
||||
log "ERROR" "Invalid stage: $start_stage"
|
||||
return 1
|
||||
;;
|
||||
esac
|
||||
}
|
||||
|
||||
# --- Main ---
|
||||
|
||||
log "INFO" "--- Homelab Deployment Started (Host: $TARGET_HOST, Service: ${TARGET_SERVICE:-all}) ---"
|
||||
|
||||
if ! load_inventory "$TARGET_HOST" "$TARGET_SERVICE"; then
|
||||
log "ERROR" "Failed to load inventory"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
EXIT_STATUS=0
|
||||
if [[ "$RESUME" == "true" ]]; then
|
||||
CURRENT=$(get_stage)
|
||||
log "INFO" "Resuming from state: $CURRENT"
|
||||
case "$CURRENT" in
|
||||
prepare|validate|deploy|verify)
|
||||
run_deployment "$CURRENT" || EXIT_STATUS=1
|
||||
;;
|
||||
complete|none)
|
||||
log "INFO" "No interrupted deployment found. Starting from scratch..."
|
||||
run_deployment "prepare" || EXIT_STATUS=1
|
||||
;;
|
||||
*)
|
||||
log "INFO" "Unknown state. Starting from prepare..."
|
||||
run_deployment "prepare" || EXIT_STATUS=1
|
||||
;;
|
||||
esac
|
||||
elif [[ -n "$REQUESTED_STAGE" ]]; then
|
||||
if [[ "$REQUESTED_STAGE" == "diagnose" ]]; then
|
||||
collect_diagnostics "$TARGET_HOST" "$TARGET_SERVICE"
|
||||
if [[ "$TARGET" == "control-plane" ]]; then
|
||||
services=("control-plane")
|
||||
else
|
||||
run_deployment "$REQUESTED_STAGE" || EXIT_STATUS=1
|
||||
local svc_yaml="${REPO_ROOT}/hosts/${TARGET}/services.yaml"
|
||||
if [[ ! -f "$svc_yaml" ]]; then
|
||||
echo "ERROR: ${svc_yaml} not found." >&2
|
||||
exit 2
|
||||
fi
|
||||
local svc_list
|
||||
svc_list=$(python3 -c "
|
||||
import yaml
|
||||
with open('${svc_yaml}') as f:
|
||||
data = yaml.safe_load(f)
|
||||
svcs = data.get('services', {})
|
||||
if isinstance(svcs, dict):
|
||||
print('\n'.join(svcs.keys()))
|
||||
elif isinstance(svcs, list):
|
||||
print('\n'.join(svcs))
|
||||
")
|
||||
while IFS= read -r svc; do
|
||||
[[ -z "$svc" ]] && continue
|
||||
if [[ -f "${REPO_ROOT}/services/${svc}/Dockerfile" ]]; then
|
||||
services+=("$svc")
|
||||
fi
|
||||
done <<< "$svc_list"
|
||||
fi
|
||||
else
|
||||
# New deployment - clear previous state
|
||||
clear_deployment_state
|
||||
run_deployment "prepare" || EXIT_STATUS=1
|
||||
|
||||
if [[ ${#services[@]} -eq 0 ]]; then
|
||||
echo "[info] No services with local Dockerfile found for ${TARGET} — gate trivially passes."
|
||||
return 0
|
||||
fi
|
||||
|
||||
echo "Services under gate: ${services[*]}"
|
||||
local gate_failed=false
|
||||
|
||||
for svc in "${services[@]}"; do
|
||||
local svc_dir="${REPO_ROOT}/services/${svc}"
|
||||
|
||||
if [[ -d "${svc_dir}/tests" ]]; then
|
||||
echo "--- pytest: ${svc} ---"
|
||||
if ! python3 -m pytest "${svc_dir}/tests" -q; then
|
||||
echo "GATE FAIL: pytest failed for ${svc}" >&2
|
||||
gate_failed=true
|
||||
fi
|
||||
fi
|
||||
|
||||
echo "--- docker build: ${svc} ---"
|
||||
if ! docker build --quiet "${svc_dir}" >/dev/null; then
|
||||
echo "GATE FAIL: docker build failed for ${svc}" >&2
|
||||
gate_failed=true
|
||||
fi
|
||||
done
|
||||
|
||||
if [[ "$gate_failed" == "true" ]]; then
|
||||
exit 2
|
||||
fi
|
||||
echo "[ok] gate passed"
|
||||
}
|
||||
|
||||
# ── EXECUTE ──────────────────────────────────────────────────────────────────
|
||||
|
||||
execute() {
|
||||
echo "=== EXECUTE ==="
|
||||
|
||||
local cmd_output
|
||||
local cmd_exit=0
|
||||
|
||||
if [[ "$TARGET" == "control-plane" ]]; then
|
||||
echo "Running deploy-control-plane.sh --ssh..."
|
||||
cmd_output=$("${REPO_ROOT}/scripts/deploy/deploy-control-plane.sh" --ssh 2>&1) \
|
||||
|| cmd_exit=$?
|
||||
else
|
||||
echo "SSHing to ${SSH_HOST}: git pull + deploy-node.sh..."
|
||||
cmd_output=$(ssh -o "ConnectTimeout=${SSH_TIMEOUT}" -o BatchMode=yes \
|
||||
"${SSH_USER}@${SSH_HOST}" \
|
||||
'cd ~/homelab-codex-ws && git pull && ./scripts/deploy/deploy-node.sh' 2>&1) \
|
||||
|| cmd_exit=$?
|
||||
fi
|
||||
|
||||
echo "$cmd_output"
|
||||
|
||||
if echo "$cmd_output" | grep -qF "[sudo] password"; then
|
||||
echo "" >&2
|
||||
echo "ERROR (exit 5): Deploy hit an interactive sudo prompt." >&2
|
||||
echo "Run manually:" >&2
|
||||
if [[ "$TARGET" == "control-plane" ]]; then
|
||||
echo " ssh -t ${SSH_USER}@${SSH_HOST} 'cd ~/homelab-codex-ws && git pull origin master && cd services/control-plane && bash deploy-local.sh'" >&2
|
||||
else
|
||||
echo " ssh -t ${SSH_USER}@${SSH_HOST} 'cd ~/homelab-codex-ws && git pull && ./scripts/deploy/deploy-node.sh'" >&2
|
||||
fi
|
||||
exit 5
|
||||
fi
|
||||
|
||||
if [[ $cmd_exit -ne 0 ]]; then
|
||||
echo "ERROR: Deploy command exited ${cmd_exit}." >&2
|
||||
exit 3
|
||||
fi
|
||||
|
||||
echo "[ok] execute completed"
|
||||
}
|
||||
|
||||
# ── VERIFY ───────────────────────────────────────────────────────────────────
|
||||
|
||||
verify() {
|
||||
echo "=== VERIFY ==="
|
||||
|
||||
local ps_output
|
||||
local ps_exit=0
|
||||
ps_output=$(ssh -o "ConnectTimeout=${SSH_TIMEOUT}" -o BatchMode=yes \
|
||||
"${SSH_USER}@${SSH_HOST}" \
|
||||
'docker ps --format "{{.Names}}\t{{.Status}}"' 2>&1) \
|
||||
|| ps_exit=$?
|
||||
|
||||
if [[ $ps_exit -ne 0 ]]; then
|
||||
echo "ERROR: docker ps failed on ${SSH_HOST}:" >&2
|
||||
echo "$ps_output" >&2
|
||||
exit 4
|
||||
fi
|
||||
|
||||
echo "$ps_output"
|
||||
|
||||
local failed=false
|
||||
|
||||
local not_up
|
||||
not_up=$(echo "$ps_output" | grep -v '^$' | grep -v $'\tUp' || true)
|
||||
if [[ -n "$not_up" ]]; then
|
||||
echo "ERROR: Containers not in Up state:" >&2
|
||||
echo "$not_up" >&2
|
||||
failed=true
|
||||
fi
|
||||
|
||||
local unhealthy
|
||||
unhealthy=$(echo "$ps_output" | grep '(unhealthy)' || true)
|
||||
if [[ -n "$unhealthy" ]]; then
|
||||
echo "ERROR: Unhealthy containers:" >&2
|
||||
echo "$unhealthy" >&2
|
||||
failed=true
|
||||
fi
|
||||
|
||||
if [[ "$TARGET" == "control-plane" ]]; then
|
||||
for cp_svc in supervisor observer executor operator-ui; do
|
||||
if ! echo "$ps_output" | grep -q "$cp_svc"; then
|
||||
echo "ERROR: control-plane component absent from docker ps: ${cp_svc}" >&2
|
||||
failed=true
|
||||
fi
|
||||
done
|
||||
fi
|
||||
|
||||
if [[ "$failed" == "true" ]]; then
|
||||
echo "" >&2
|
||||
echo "Full docker ps output above." >&2
|
||||
exit 4
|
||||
fi
|
||||
|
||||
echo "[ok] all containers healthy"
|
||||
}
|
||||
|
||||
# ── REPORT ───────────────────────────────────────────────────────────────────
|
||||
|
||||
report() {
|
||||
local mode="${1:-deploy}"
|
||||
local end_time
|
||||
end_time=$(date +%s)
|
||||
local elapsed
|
||||
elapsed=$(( end_time - START_TIME ))
|
||||
local commit_hash
|
||||
commit_hash=$(git -C "$REPO_ROOT" rev-parse --short HEAD)
|
||||
local gate_s verify_s
|
||||
|
||||
if [[ "$NO_GATE" == "true" ]]; then
|
||||
gate_s="skip"
|
||||
else
|
||||
gate_s="ok"
|
||||
fi
|
||||
|
||||
if [[ "$mode" == "dry-run" ]]; then
|
||||
verify_s="skip(dry-run)"
|
||||
else
|
||||
verify_s="green"
|
||||
fi
|
||||
|
||||
echo ""
|
||||
if [[ "$mode" == "dry-run" ]]; then
|
||||
echo "DRY RUN OK | target=${TARGET} | commit=${commit_hash} | gate=${gate_s} | verify=${verify_s} | ${elapsed}s"
|
||||
else
|
||||
echo "DEPLOY OK | target=${TARGET} | commit=${commit_hash} | gate=${gate_s} | verify=${verify_s} | ${elapsed}s"
|
||||
fi
|
||||
}
|
||||
|
||||
# ── MAIN ─────────────────────────────────────────────────────────────────────
|
||||
|
||||
preflight
|
||||
gate
|
||||
|
||||
if [[ "$DRY_RUN" == "true" ]]; then
|
||||
report dry-run
|
||||
exit 0
|
||||
fi
|
||||
|
||||
if [[ $EXIT_STATUS -eq 0 ]]; then
|
||||
print_summary "$TARGET_HOST" "SUCCESS"
|
||||
log "INFO" "--- Homelab Deployment Finished Successfully ---"
|
||||
else
|
||||
print_summary "$TARGET_HOST" "FAILED"
|
||||
log "ERROR" "--- Homelab Deployment Failed ---"
|
||||
exit 1
|
||||
fi
|
||||
execute
|
||||
verify
|
||||
report
|
||||
|
|
|
|||
|
|
@ -1,15 +1,30 @@
|
|||
#!/usr/bin/env bash
|
||||
# orchestrate-deploy.sh - To be run on SATURN
|
||||
# Triggers deployment on remote execution nodes.
|
||||
# Triggers deployment on remote execution nodes via inventory.
|
||||
|
||||
set -e
|
||||
|
||||
HOSTS=("solaria" "piha" "vps")
|
||||
USER="oskar" # Default user
|
||||
REPO_PATH="${HOME}/homelab-codex-ws"
|
||||
USER="oskar"
|
||||
|
||||
for HOST in "${HOSTS[@]}"; do
|
||||
while IFS=' ' read -r HOST TAG; do
|
||||
echo ">>> Triggering deployment on ${HOST}..."
|
||||
ssh "${USER}@${HOST}" "bash ~/homelab-codex-ws/scripts/deploy/deploy-node.sh"
|
||||
done
|
||||
if [[ "$TAG" == "lte" ]]; then
|
||||
ssh -o ConnectTimeout=30 "${USER}@${HOST}" "bash ~/homelab-codex-ws/scripts/deploy/deploy-node.sh" || \
|
||||
echo "WARNING: Deployment on ${HOST} failed or timed out (LTE/intermittent node, skipping)"
|
||||
else
|
||||
ssh "${USER}@${HOST}" "bash ~/homelab-codex-ws/scripts/deploy/deploy-node.sh"
|
||||
fi
|
||||
done < <(python3 -c "
|
||||
import yaml, sys
|
||||
with open('${REPO_PATH}/inventory/topology.yaml') as f:
|
||||
data = yaml.safe_load(f)
|
||||
skip = {'saturn', 'solaria'}
|
||||
for name, info in (data.get('nodes') or {}).items():
|
||||
if name in skip:
|
||||
continue
|
||||
uplink = ((info or {}).get('connectivity') or {}).get('uplink', '')
|
||||
print(name, 'lte' if uplink == 'lte' else 'standard')
|
||||
")
|
||||
|
||||
echo ">>> All deployments triggered."
|
||||
|
|
|
|||
68
scripts/deploy/verify-agent-fleet.sh
Executable file
68
scripts/deploy/verify-agent-fleet.sh
Executable file
|
|
@ -0,0 +1,68 @@
|
|||
#!/usr/bin/env bash
|
||||
# verify-agent-fleet.sh - Check the status of stability agents across the fleet
|
||||
|
||||
REDIS_CMD="docker exec agent-system-redis redis-cli --raw"
|
||||
|
||||
# Check if docker is available
|
||||
if ! command -v docker &> /dev/null; then
|
||||
echo "Error: docker command not found."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Check if container is running
|
||||
if ! docker ps --filter "name=agent-system-redis" --format "{{.Names}}" | grep -q "agent-system-redis"; then
|
||||
echo "Error: agent-system-redis container not found or not running."
|
||||
echo "This script must be run on PIHA (the node hosting the Redis container)."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
REQUIRED_NODES=("piha" "chelsty" "solaria" "vps")
|
||||
MISSING_NODES=0
|
||||
|
||||
echo "--- Homelab Agent Fleet Status ---"
|
||||
printf "%-10s %-15s %-10s %-10s %-30s\n" "NODE" "HOSTNAME" "HEALTH" "STATUS" "LAST_SEEN"
|
||||
printf "%s\n" "--------------------------------------------------------------------------------"
|
||||
|
||||
for NODE in "${REQUIRED_NODES[@]}"; do
|
||||
KEY="homelab:nodes:$NODE"
|
||||
|
||||
# Check if key exists
|
||||
EXISTS=$($REDIS_CMD EXISTS "$KEY" 2>/dev/null | tr -d '\r\n')
|
||||
|
||||
if [[ "$EXISTS" != "1" ]]; then
|
||||
printf "%-10s %-15s %-10s %-10s %-30s\n" "$NODE" "MISSING" "N/A" "N/A" "N/A"
|
||||
MISSING_NODES=$((MISSING_NODES + 1))
|
||||
continue
|
||||
fi
|
||||
|
||||
HOSTNAME=$($REDIS_CMD HGET "$KEY" hostname 2>/dev/null | tr -d '\r\n')
|
||||
HEALTH=$($REDIS_CMD HGET "$KEY" health 2>/dev/null | tr -d '\r\n')
|
||||
STATUS=$($REDIS_CMD HGET "$KEY" status 2>/dev/null | tr -d '\r\n')
|
||||
LAST_SEEN=$($REDIS_CMD HGET "$KEY" last_seen 2>/dev/null | tr -d '\r\n')
|
||||
|
||||
printf "%-10s %-15s %-10s %-10s %-30s\n" "$NODE" "$HOSTNAME" "$HEALTH" "$STATUS" "$LAST_SEEN"
|
||||
done
|
||||
|
||||
echo ""
|
||||
echo "--- Control Plane Summary ---"
|
||||
if command -v jq >/dev/null; then
|
||||
curl -s http://127.0.0.1:18180/summary | jq .
|
||||
else
|
||||
curl -s http://127.0.0.1:18180/summary
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo "--- Control Plane Nodes ---"
|
||||
if command -v jq >/dev/null; then
|
||||
curl -s http://127.0.0.1:18180/nodes | jq .
|
||||
else
|
||||
curl -s http://127.0.0.1:18180/nodes
|
||||
fi
|
||||
|
||||
if [[ $MISSING_NODES -gt 0 ]]; then
|
||||
echo ""
|
||||
echo "Error: $MISSING_NODES required nodes are missing from Redis."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
exit 0
|
||||
361
scripts/dev/agent.sh
Executable file
361
scripts/dev/agent.sh
Executable file
|
|
@ -0,0 +1,361 @@
|
|||
#!/usr/bin/env bash
|
||||
# Multi-agent worktree manager.
|
||||
# EXIT: 0 ok, 1 preflight, 2 operation failed.
|
||||
set -euo pipefail
|
||||
|
||||
trap 'echo "agent.sh: failed at line $LINENO (exit $?)" >&2' ERR
|
||||
|
||||
RESERVED_NAMES=(master main HEAD list merge clean new)
|
||||
MAX_WORKTREES=4
|
||||
|
||||
die() { echo "ERROR: $*" >&2; exit "${2:-2}"; }
|
||||
prefail(){ echo "PREFLIGHT: $*" >&2; exit 1; }
|
||||
|
||||
# ── helpers ──────────────────────────────────────────────────────────────────
|
||||
|
||||
is_main_checkout() {
|
||||
local git_dir common_dir
|
||||
git_dir=$(git rev-parse --git-dir 2>/dev/null) || return 1
|
||||
common_dir=$(git rev-parse --git-common-dir 2>/dev/null) || return 1
|
||||
[ "$git_dir" = "$common_dir" ]
|
||||
}
|
||||
|
||||
require_main_checkout() {
|
||||
is_main_checkout || prefail "must run from the main checkout, not a worktree"
|
||||
}
|
||||
|
||||
require_master_branch() {
|
||||
local branch
|
||||
branch=$(git rev-parse --abbrev-ref HEAD)
|
||||
[ "$branch" = "master" ] || prefail "must be on master (currently on '$branch')"
|
||||
}
|
||||
|
||||
require_clean_tree() {
|
||||
local dirty
|
||||
dirty=$(git status --porcelain)
|
||||
[ -z "$dirty" ] || prefail "working tree is not clean — stash or commit first"
|
||||
}
|
||||
|
||||
worktree_paths() {
|
||||
# list worktree paths (excluding main); || true prevents grep exit-1 when empty
|
||||
local main_path
|
||||
main_path=$(git rev-parse --show-toplevel)
|
||||
git worktree list --porcelain \
|
||||
| awk '/^worktree /{p=$2} /^$/{print p}' \
|
||||
| grep -v "^${main_path}$" \
|
||||
|| true
|
||||
}
|
||||
|
||||
worktree_count() {
|
||||
worktree_paths | wc -l
|
||||
}
|
||||
|
||||
branch_exists_local() { git show-ref --verify --quiet "refs/heads/$1"; }
|
||||
branch_exists_remote() { git ls-remote --exit-code origin "$1" >/dev/null 2>&1; }
|
||||
|
||||
utc_now() { date -u +"%Y-%m-%dT%H:%M:%SZ"; }
|
||||
|
||||
age_str() {
|
||||
local created_utc="$1"
|
||||
local now_ts created_ts diff_s
|
||||
now_ts=$(date -u +%s)
|
||||
# strip Z, replace T with space for `date -d`
|
||||
created_ts=$(date -u -d "${created_utc//T/ }" +%s 2>/dev/null) || { echo "?"; return; }
|
||||
diff_s=$(( now_ts - created_ts ))
|
||||
if (( diff_s < 60 )); then echo "${diff_s}s"
|
||||
elif (( diff_s < 3600 )); then echo "$(( diff_s/60 ))m"
|
||||
elif (( diff_s < 86400 )); then echo "$(( diff_s/3600 ))h"
|
||||
else echo "$(( diff_s/86400 ))d"
|
||||
fi
|
||||
}
|
||||
|
||||
validate_name() {
|
||||
local name="$1"
|
||||
if ! [[ "$name" =~ ^[a-z][a-z0-9-]*$ ]]; then
|
||||
prefail "name '$name' must match ^[a-z][a-z0-9-]*$"
|
||||
fi
|
||||
for r in "${RESERVED_NAMES[@]}"; do
|
||||
if [ "$name" = "$r" ]; then
|
||||
prefail "'$name' is a reserved word"
|
||||
fi
|
||||
done
|
||||
}
|
||||
|
||||
# ── subcommands ───────────────────────────────────────────────────────────────
|
||||
|
||||
cmd_new() {
|
||||
local name="${1:-}"
|
||||
[ -n "$name" ] || { usage; exit 1; }
|
||||
|
||||
validate_name "$name"
|
||||
require_main_checkout
|
||||
require_master_branch
|
||||
require_clean_tree
|
||||
|
||||
# worktree limit
|
||||
local count
|
||||
count=$(worktree_count)
|
||||
if (( count >= MAX_WORKTREES )); then
|
||||
echo "ERROR: already at maximum of $MAX_WORKTREES active worktrees:" >&2
|
||||
cmd_list
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# branch collision
|
||||
if branch_exists_local "task/$name"; then
|
||||
prefail "branch task/$name already exists locally"
|
||||
fi
|
||||
git fetch origin master --quiet
|
||||
if branch_exists_remote "refs/heads/task/$name"; then
|
||||
prefail "branch task/$name already exists on origin"
|
||||
fi
|
||||
|
||||
# directory collision
|
||||
local main_path wt_path
|
||||
main_path=$(git rev-parse --show-toplevel)
|
||||
wt_path="$(dirname "$main_path")/homelab-codex-ws-${name}"
|
||||
[ ! -e "$wt_path" ] || prefail "directory $wt_path already exists"
|
||||
|
||||
# create worktree
|
||||
git worktree add -b "task/$name" "$wt_path" origin/master \
|
||||
|| die "git worktree add failed"
|
||||
|
||||
# write marker
|
||||
local parent_commit
|
||||
parent_commit=$(git rev-parse origin/master)
|
||||
cat > "$wt_path/.agent-task" <<EOF
|
||||
task: $name
|
||||
branch: task/$name
|
||||
parent_commit: $parent_commit
|
||||
created_utc: $(utc_now)
|
||||
worktree_path: $wt_path
|
||||
EOF
|
||||
|
||||
echo ""
|
||||
echo "Worktree created: $wt_path"
|
||||
echo "Branch: task/$name"
|
||||
echo ""
|
||||
echo "── Start Claude Code in this worktree ──────────────────────────────────────"
|
||||
echo "cd ~/homelab-codex-ws-${name} && claude --dangerously-skip-permissions \"Jesteś w worktree task '${name}' (branch task/${name}). NAJPIERW przeczytaj .agent-task i .claude/skills/worktree-aware/SKILL.md, dopiero potem zacznij pracę. Commituj wyłącznie na swoją gałąź; nie pushuj origin master.\""
|
||||
echo "─────────────────────────────────────────────────────────────────────────────"
|
||||
}
|
||||
|
||||
cmd_list() {
|
||||
local main_path
|
||||
main_path=$(git rev-parse --show-toplevel)
|
||||
|
||||
# fetch to get up-to-date ahead/behind
|
||||
git fetch origin master --quiet 2>/dev/null || true
|
||||
|
||||
local paths
|
||||
paths=$(worktree_paths)
|
||||
|
||||
if [ -z "$paths" ]; then
|
||||
echo "(no active task worktrees)"
|
||||
return
|
||||
fi
|
||||
|
||||
printf "%-20s %-25s %-10s %-8s %-8s %-7s %s\n" \
|
||||
"NAME" "BRANCH" "CREATED" "AGE" "STATUS" "A/B" "PARENT"
|
||||
|
||||
while IFS= read -r wt_path; do
|
||||
[ -z "$wt_path" ] && continue
|
||||
|
||||
local marker="$wt_path/.agent-task"
|
||||
local task_name branch parent_commit created_utc
|
||||
if [ -f "$marker" ]; then
|
||||
task_name=$( grep '^task:' "$marker" | awk '{print $2}')
|
||||
branch=$( grep '^branch:' "$marker" | awk '{print $2}')
|
||||
parent_commit=$(grep '^parent_commit:' "$marker" | awk '{print $2}')
|
||||
created_utc=$(grep '^created_utc:' "$marker" | awk '{print $2}')
|
||||
else
|
||||
task_name="(no marker)"
|
||||
branch=$(git -C "$wt_path" rev-parse --abbrev-ref HEAD 2>/dev/null || echo "?")
|
||||
parent_commit="?"
|
||||
created_utc=""
|
||||
fi
|
||||
|
||||
local status="clean"
|
||||
local dirty
|
||||
dirty=$(git -C "$wt_path" status --porcelain 2>/dev/null || echo "?")
|
||||
[ -n "$dirty" ] && status="dirty"
|
||||
|
||||
local ahead behind ab
|
||||
ahead=$(git -C "$wt_path" rev-list --count "origin/master..${branch}" 2>/dev/null || echo "?")
|
||||
behind=$(git -C "$wt_path" rev-list --count "${branch}..origin/master" 2>/dev/null || echo "?")
|
||||
ab="+${ahead}/-${behind}"
|
||||
|
||||
local age=""
|
||||
[ -n "$created_utc" ] && age=$(age_str "$created_utc")
|
||||
|
||||
local short_parent="${parent_commit:0:7}"
|
||||
local short_created="${created_utc:0:10}"
|
||||
|
||||
printf "%-20s %-25s %-10s %-8s %-8s %-7s %s\n" \
|
||||
"$task_name" "$branch" "$short_created" "$age" "$status" "$ab" "$short_parent"
|
||||
done <<< "$paths"
|
||||
}
|
||||
|
||||
cmd_merge() {
|
||||
local name="${1:-}"
|
||||
[ -n "$name" ] || { usage; exit 1; }
|
||||
|
||||
require_main_checkout
|
||||
require_master_branch
|
||||
require_clean_tree
|
||||
|
||||
git fetch origin --quiet
|
||||
|
||||
branch_exists_local "task/$name" || die "branch task/$name not found locally" 1
|
||||
|
||||
local main_path wt_path
|
||||
main_path=$(git rev-parse --show-toplevel)
|
||||
wt_path="$(dirname "$main_path")/homelab-codex-ws-${name}"
|
||||
|
||||
# attempt ff-only merge
|
||||
local merge_failed=0
|
||||
git merge --ff-only "task/$name" || merge_failed=1
|
||||
|
||||
if (( merge_failed )); then
|
||||
# abort any partial merge state
|
||||
git merge --abort 2>/dev/null || true
|
||||
echo ""
|
||||
echo "ERROR: task/$name cannot be fast-forwarded into master." >&2
|
||||
echo " The branch has likely diverged from master." >&2
|
||||
echo "" >&2
|
||||
echo "Diagnose with:" >&2
|
||||
echo " git log master..task/$name # commits only on task branch" >&2
|
||||
echo " git log task/$name..master # commits master has that task doesn't" >&2
|
||||
echo "" >&2
|
||||
echo "Then decide: rebase task/$name onto master, or merge manually." >&2
|
||||
echo "Worktree and branch are preserved — no changes made." >&2
|
||||
exit 2
|
||||
fi
|
||||
|
||||
echo "Merged task/$name into master (fast-forward)."
|
||||
|
||||
git push origin master || die "git push origin master failed"
|
||||
echo "Pushed master to origin."
|
||||
|
||||
if [ -d "$wt_path" ]; then
|
||||
git worktree remove "$wt_path" || die "git worktree remove $wt_path failed"
|
||||
echo "Removed worktree: $wt_path"
|
||||
else
|
||||
echo "(worktree directory $wt_path not found — skipping worktree remove)"
|
||||
fi
|
||||
|
||||
git branch -d "task/$name" || die "git branch -d task/$name failed"
|
||||
echo "Deleted local branch task/$name."
|
||||
|
||||
git push origin --delete "task/$name" 2>/dev/null \
|
||||
&& echo "Deleted remote branch task/$name." \
|
||||
|| echo "(remote branch task/$name not found — nothing to delete)"
|
||||
|
||||
echo ""
|
||||
echo "Done. task/$name merged and cleaned up."
|
||||
}
|
||||
|
||||
cmd_clean() {
|
||||
local main_path
|
||||
main_path=$(git rev-parse --show-toplevel)
|
||||
git fetch origin --quiet 2>/dev/null || true
|
||||
|
||||
local to_remove=()
|
||||
|
||||
# orphaned registered worktrees: branch deleted or fully merged into master
|
||||
local paths
|
||||
paths=$(worktree_paths)
|
||||
while IFS= read -r wt_path; do
|
||||
[ -z "$wt_path" ] && continue
|
||||
local branch
|
||||
branch=$(git -C "$wt_path" rev-parse --abbrev-ref HEAD 2>/dev/null || echo "")
|
||||
[ -z "$branch" ] && { to_remove+=("worktree:$wt_path (unreadable branch)"); continue; }
|
||||
|
||||
# branch gone locally?
|
||||
if ! branch_exists_local "$branch"; then
|
||||
to_remove+=("worktree:$wt_path (branch $branch no longer exists)")
|
||||
continue
|
||||
fi
|
||||
|
||||
# branch fully merged into master?
|
||||
local ahead
|
||||
ahead=$(git rev-list --count "origin/master..${branch}" 2>/dev/null || echo "1")
|
||||
if [ "$ahead" = "0" ]; then
|
||||
to_remove+=("worktree:$wt_path (branch $branch fully merged into origin/master)")
|
||||
fi
|
||||
done <<< "$paths"
|
||||
|
||||
# dangling directories: ../homelab-codex-ws-* not registered
|
||||
local registered_paths
|
||||
registered_paths=$(git worktree list --porcelain | awk '/^worktree /{print $2}')
|
||||
local parent_dir
|
||||
parent_dir=$(dirname "$main_path")
|
||||
while IFS= read -r candidate; do
|
||||
[ -d "$candidate" ] || continue
|
||||
if ! echo "$registered_paths" | grep -qF "$candidate"; then
|
||||
to_remove+=("dangling:$candidate")
|
||||
fi
|
||||
done < <(find "$parent_dir" -maxdepth 1 -name "homelab-codex-ws-*" -type d 2>/dev/null)
|
||||
|
||||
if [ ${#to_remove[@]} -eq 0 ]; then
|
||||
echo "Nothing to clean."
|
||||
return 0
|
||||
fi
|
||||
|
||||
echo "Found ${#to_remove[@]} item(s) to clean:"
|
||||
for entry in "${to_remove[@]}"; do
|
||||
echo " $entry"
|
||||
done
|
||||
echo ""
|
||||
|
||||
local overall_rc=0
|
||||
for entry in "${to_remove[@]}"; do
|
||||
local kind="${entry%%:*}"
|
||||
local path="${entry#*:}"
|
||||
# strip trailing annotation in parens
|
||||
local raw_path
|
||||
raw_path="${path%% (*}"
|
||||
|
||||
local confirm
|
||||
read -r -p "Remove $kind '$raw_path'? [y/N] " confirm
|
||||
if [[ "$confirm" =~ ^[Yy]$ ]]; then
|
||||
if [ "$kind" = "worktree" ]; then
|
||||
git worktree remove --force "$raw_path" 2>/dev/null \
|
||||
|| { echo " WARNING: git worktree remove failed, trying rm -rf"; rm -rf "$raw_path" || true; }
|
||||
else
|
||||
rm -rf "$raw_path"
|
||||
fi
|
||||
echo " Removed."
|
||||
else
|
||||
echo " Skipped."
|
||||
fi
|
||||
done
|
||||
|
||||
return $overall_rc
|
||||
}
|
||||
|
||||
usage() {
|
||||
cat <<'EOF'
|
||||
Usage: agent.sh <subcommand> [args]
|
||||
|
||||
agent.sh new <name> Create a new task worktree (branch task/<name>)
|
||||
agent.sh list List active task worktrees with status
|
||||
agent.sh merge <name> Fast-forward merge task/<name> into master and clean up
|
||||
agent.sh clean Remove orphaned or dangling worktrees (interactive)
|
||||
|
||||
EXIT: 0 ok, 1 preflight, 2 operation failed.
|
||||
EOF
|
||||
}
|
||||
|
||||
# ── dispatch ──────────────────────────────────────────────────────────────────
|
||||
|
||||
SUBCOMMAND="${1:-}"
|
||||
shift || true
|
||||
|
||||
case "$SUBCOMMAND" in
|
||||
new) cmd_new "$@" ;;
|
||||
list) cmd_list "$@" ;;
|
||||
merge) cmd_merge "$@" ;;
|
||||
clean) cmd_clean "$@" ;;
|
||||
*) usage; exit 1 ;;
|
||||
esac
|
||||
338
scripts/monitor/health-monitor.sh
Executable file
338
scripts/monitor/health-monitor.sh
Executable file
|
|
@ -0,0 +1,338 @@
|
|||
#!/usr/bin/env bash
|
||||
# health-monitor.sh - Homelab node health monitor and safe disk cleanup
|
||||
#
|
||||
# Designed to run standalone on the host (cron or direct) or to be called by
|
||||
# the node-agent Python daemon. All cleanup decisions follow the conservative
|
||||
# policy agreed in the design review:
|
||||
#
|
||||
# lte_node (chelsty-infra, chelsty-ha) : NO cleanup at all
|
||||
# sd_card (piha, saturn) : dangling images + stopped containers,
|
||||
# rate-limited to once per 24 h
|
||||
# ai_node (solaria) : dangling images + stopped containers
|
||||
# + build cache (NEVER -a)
|
||||
# standard (vps) : dangling images + stopped containers
|
||||
# + build cache
|
||||
#
|
||||
# VPS additionally rotates control-plane filesystem artefacts:
|
||||
# actions/completed + failed > 7 days
|
||||
# logs/deploy > 30 days
|
||||
# events/** > 3 days AND past observer checkpoint
|
||||
#
|
||||
# NEVER TOUCHED (any node): /opt/homelab/data/, config/, state/,
|
||||
# actions/pending|approved|running, Frigate recordings, Ollama models,
|
||||
# Zigbee2MQTT data, Mosquitto data, HA database/config.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Configuration
|
||||
# ---------------------------------------------------------------------------
|
||||
RUNTIME_PATH="${RUNTIME_PATH:-/opt/homelab}"
|
||||
EVENTS_DIR="${RUNTIME_PATH}/events"
|
||||
STATE_DIR="${RUNTIME_PATH}/state"
|
||||
LOGS_DIR="${RUNTIME_PATH}/logs"
|
||||
ACTIONS_DIR="${RUNTIME_PATH}/actions"
|
||||
|
||||
NODE_NAME="${NODE_NAME:-$(hostname)}"
|
||||
TIMESTAMP=$(date +%s)
|
||||
DATE=$(date -u +%Y-%m-%dT%H:%M:%SZ)
|
||||
|
||||
# Thresholds
|
||||
DISK_WARN_PCT=75
|
||||
DISK_CRIT_PCT=85
|
||||
MEM_WARN_PCT=85
|
||||
MEM_CRIT_PCT=95
|
||||
|
||||
# Rate-limit file for SD-card nodes (max one Docker cleanup per 24 h)
|
||||
CLEANUP_LOCK="${STATE_DIR}/last-docker-cleanup"
|
||||
CLEANUP_INTERVAL=86400 # seconds
|
||||
|
||||
# Node classifications
|
||||
LTE_NODES="chelsty-infra chelsty-ha"
|
||||
SD_CARD_NODES="piha saturn"
|
||||
AI_NODES="solaria"
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
log() { echo "$(date -u +%H:%M:%S) [INFO] $*"; }
|
||||
warn() { echo "$(date -u +%H:%M:%S) [WARN] $*" >&2; }
|
||||
err() { echo "$(date -u +%H:%M:%S) [ERROR] $*" >&2; }
|
||||
|
||||
contains() {
|
||||
local word="$1"; shift
|
||||
for w in "$@"; do [[ "$w" == "$word" ]] && return 0; done
|
||||
return 1
|
||||
}
|
||||
|
||||
get_node_type() {
|
||||
# shellcheck disable=SC2086
|
||||
if contains "$NODE_NAME" $LTE_NODES; then echo "lte_node"; return; fi
|
||||
if contains "$NODE_NAME" $SD_CARD_NODES; then echo "sd_card"; return; fi
|
||||
if contains "$NODE_NAME" $AI_NODES; then echo "ai_node"; return; fi
|
||||
echo "standard"
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Event emission
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
emit_event() {
|
||||
local type="$1" severity="$2" service="${3:-}" message="$4" payload="${5:-{}}"
|
||||
local id="evt-${NODE_NAME}-${TIMESTAMP}-${type}"
|
||||
local dir="${EVENTS_DIR}/${NODE_NAME}"
|
||||
mkdir -p "$dir"
|
||||
cat > "${dir}/${id}.json" <<EOF
|
||||
{
|
||||
"id": "${id}",
|
||||
"timestamp": ${TIMESTAMP},
|
||||
"date": "${DATE}",
|
||||
"type": "${type}",
|
||||
"severity": "${severity}",
|
||||
"node": "${NODE_NAME}",
|
||||
"service": "${service}",
|
||||
"message": "${message}",
|
||||
"payload": ${payload}
|
||||
}
|
||||
EOF
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Health checks
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
check_disk() {
|
||||
# Use /opt/homelab as the check target — it lives on the host filesystem
|
||||
# and this path is correct both when running natively and in a container
|
||||
# that mounts /opt/homelab from the host.
|
||||
local mount="${RUNTIME_PATH}"
|
||||
local usage_pct avail_mb total_mb
|
||||
usage_pct=$(df "${mount}" 2>/dev/null | awk 'NR==2 {gsub(/%/,"",$5); print $5}') || return
|
||||
avail_mb=$(df "${mount}" 2>/dev/null | awk 'NR==2 {printf "%d", $4/1024}') || return
|
||||
total_mb=$(df "${mount}" 2>/dev/null | awk 'NR==2 {printf "%d", $2/1024}') || return
|
||||
|
||||
if [[ "${usage_pct}" -ge "${DISK_CRIT_PCT}" ]]; then
|
||||
warn "Disk CRITICAL: ${usage_pct}% used (${avail_mb} MB free)"
|
||||
emit_event "disk_pressure" "high" "" \
|
||||
"Disk usage critical: ${usage_pct}% on ${mount} (${avail_mb} MB free)" \
|
||||
"{\"usage_pct\": ${usage_pct}, \"avail_mb\": ${avail_mb}, \"total_mb\": ${total_mb}, \"mount\": \"${mount}\"}"
|
||||
elif [[ "${usage_pct}" -ge "${DISK_WARN_PCT}" ]]; then
|
||||
warn "Disk elevated: ${usage_pct}% used"
|
||||
emit_event "disk_pressure" "medium" "" \
|
||||
"Disk usage elevated: ${usage_pct}% on ${mount} (${avail_mb} MB free)" \
|
||||
"{\"usage_pct\": ${usage_pct}, \"avail_mb\": ${avail_mb}, \"total_mb\": ${total_mb}, \"mount\": \"${mount}\"}"
|
||||
fi
|
||||
echo "${usage_pct}"
|
||||
}
|
||||
|
||||
check_memory() {
|
||||
local total avail pct avail_mb
|
||||
total=$(awk '/^MemTotal/ {print $2}' /proc/meminfo)
|
||||
avail=$(awk '/^MemAvailable/ {print $2}' /proc/meminfo)
|
||||
pct=$(( (total - avail) * 100 / total ))
|
||||
avail_mb=$(( avail / 1024 ))
|
||||
|
||||
if [[ "${pct}" -ge "${MEM_CRIT_PCT}" ]]; then
|
||||
warn "Memory CRITICAL: ${pct}% used"
|
||||
emit_event "high_memory" "high" "" \
|
||||
"Memory usage critical: ${pct}% (${avail_mb} MB available)" \
|
||||
"{\"usage_pct\": ${pct}, \"avail_mb\": ${avail_mb}, \"total_mb\": $((total/1024))}"
|
||||
elif [[ "${pct}" -ge "${MEM_WARN_PCT}" ]]; then
|
||||
warn "Memory elevated: ${pct}%"
|
||||
emit_event "high_memory" "medium" "" \
|
||||
"Memory usage elevated: ${pct}% (${avail_mb} MB available)" \
|
||||
"{\"usage_pct\": ${pct}, \"avail_mb\": ${avail_mb}, \"total_mb\": $((total/1024))}"
|
||||
fi
|
||||
echo "${pct}"
|
||||
}
|
||||
|
||||
check_cpu() {
|
||||
# Two-sample /proc/stat delta for accurate instantaneous CPU usage.
|
||||
local idle1 total1 idle2 total2 pct
|
||||
read -r idle1 total1 < <(awk '/^cpu / {idle=$5; total=0; for(i=2;i<=NF;i++) total+=$i; print idle, total}' /proc/stat)
|
||||
sleep 1
|
||||
read -r idle2 total2 < <(awk '/^cpu / {idle=$5; total=0; for(i=2;i<=NF;i++) total+=$i; print idle, total}' /proc/stat)
|
||||
|
||||
local d_idle=$(( idle2 - idle1 ))
|
||||
local d_total=$(( total2 - total1 ))
|
||||
pct=$(( d_total > 0 ? 100 - d_idle * 100 / d_total : 0 ))
|
||||
|
||||
if [[ "${pct}" -ge 90 ]]; then
|
||||
warn "CPU elevated: ${pct}%"
|
||||
emit_event "high_cpu" "medium" "" \
|
||||
"CPU usage elevated: ${pct}%" \
|
||||
"{\"usage_pct\": ${pct}}"
|
||||
fi
|
||||
echo "${pct}"
|
||||
}
|
||||
|
||||
check_containers() {
|
||||
command -v docker &>/dev/null || return
|
||||
|
||||
# Containers that have exited but carry a restart policy meaning they should be up
|
||||
local cname
|
||||
while IFS= read -r cname; do
|
||||
[[ -z "$cname" ]] && continue
|
||||
warn "Container exited (should be running): ${cname}"
|
||||
emit_event "containers_not_running" "high" "${cname}" \
|
||||
"Container '${cname}' has exited unexpectedly (restart=unless-stopped)" \
|
||||
"{\"container\": \"${cname}\"}"
|
||||
done < <(docker ps -a \
|
||||
--filter "status=exited" \
|
||||
--filter "label=com.docker.compose.project" \
|
||||
--format "{{.Names}}" 2>/dev/null || true)
|
||||
|
||||
# Containers that are running but their health check is failing
|
||||
while IFS= read -r cname; do
|
||||
[[ -z "$cname" ]] && continue
|
||||
warn "Container unhealthy: ${cname}"
|
||||
emit_event "healthcheck_failed" "high" "${cname}" \
|
||||
"Container '${cname}' is running but health check is failing" \
|
||||
"{\"container\": \"${cname}\"}"
|
||||
done < <(docker ps \
|
||||
--filter "health=unhealthy" \
|
||||
--format "{{.Names}}" 2>/dev/null || true)
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Safe Docker cleanup (per policy)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
_sd_card_rate_ok() {
|
||||
if [[ -f "${CLEANUP_LOCK}" ]]; then
|
||||
local last_ts elapsed
|
||||
last_ts=$(cat "${CLEANUP_LOCK}" 2>/dev/null || echo 0)
|
||||
elapsed=$(( TIMESTAMP - last_ts ))
|
||||
if [[ "${elapsed}" -lt "${CLEANUP_INTERVAL}" ]]; then
|
||||
log "Docker cleanup skipped: last run ${elapsed}s ago (limit ${CLEANUP_INTERVAL}s)"
|
||||
return 1
|
||||
fi
|
||||
fi
|
||||
return 0
|
||||
}
|
||||
|
||||
_mark_cleanup_done() {
|
||||
echo "${TIMESTAMP}" > "${CLEANUP_LOCK}"
|
||||
}
|
||||
|
||||
run_safe_cleanup() {
|
||||
command -v docker &>/dev/null || return
|
||||
local node_type
|
||||
node_type=$(get_node_type)
|
||||
|
||||
case "${node_type}" in
|
||||
lte_node)
|
||||
# NO cleanup on LTE nodes. Any docker operation risks triggering
|
||||
# a pull over a metered/intermittent connection.
|
||||
log "Skipping Docker cleanup: LTE node (${NODE_NAME})"
|
||||
;;
|
||||
|
||||
sd_card)
|
||||
# Dangling images + stopped containers only.
|
||||
# Rate-limited to once per 24 hours to protect SD card write endurance.
|
||||
_sd_card_rate_ok || return
|
||||
log "Running rate-limited Docker cleanup (SD card node)"
|
||||
docker image prune -f >/dev/null 2>&1 || true
|
||||
docker container prune -f >/dev/null 2>&1 || true
|
||||
_mark_cleanup_done
|
||||
;;
|
||||
|
||||
ai_node)
|
||||
# Dangling images + stopped containers + build cache.
|
||||
# NEVER docker image prune -a (would remove Ollama runtime images,
|
||||
# requiring a multi-hour re-pull of model weights).
|
||||
log "Running AI-node Docker cleanup (dangling images + containers + build cache)"
|
||||
docker image prune -f >/dev/null 2>&1 || true
|
||||
docker container prune -f >/dev/null 2>&1 || true
|
||||
docker builder prune -f >/dev/null 2>&1 || true
|
||||
;;
|
||||
|
||||
standard)
|
||||
# VPS and other standard nodes: full safe cleanup.
|
||||
log "Running standard Docker cleanup"
|
||||
docker image prune -f >/dev/null 2>&1 || true
|
||||
docker container prune -f >/dev/null 2>&1 || true
|
||||
docker builder prune -f >/dev/null 2>&1 || true
|
||||
;;
|
||||
esac
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# VPS-specific: control-plane filesystem rotation
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
cleanup_control_plane_fs() {
|
||||
log "Running control-plane filesystem rotation"
|
||||
|
||||
# Completed / failed actions older than 7 days
|
||||
for status in completed failed; do
|
||||
local dir="${ACTIONS_DIR}/${status}"
|
||||
[[ -d "${dir}" ]] || continue
|
||||
find "${dir}" -name "*.json" -mtime +7 -delete 2>/dev/null && \
|
||||
log "Cleaned ${status} actions older than 7 days" || true
|
||||
done
|
||||
|
||||
# Deploy logs older than 30 days
|
||||
local deploy_logs="${LOGS_DIR}/deploy"
|
||||
if [[ -d "${deploy_logs}" ]]; then
|
||||
find "${deploy_logs}" -name "*.log" -mtime +30 -delete 2>/dev/null && \
|
||||
log "Cleaned deploy logs older than 30 days" || true
|
||||
fi
|
||||
|
||||
# Event files older than 3 days AND already past the observer checkpoint.
|
||||
# The dual condition ensures we never delete an event the observer hasn't seen.
|
||||
local checkpoint="${STATE_DIR}/observer_checkpoint.json"
|
||||
if [[ -f "${checkpoint}" ]] && command -v python3 &>/dev/null; then
|
||||
local last_processed
|
||||
last_processed=$(python3 -c "
|
||||
import json, sys
|
||||
try:
|
||||
d = json.load(open('${checkpoint}'))
|
||||
print(d.get('last_processed_file', ''))
|
||||
except Exception:
|
||||
print('')
|
||||
" 2>/dev/null || echo "")
|
||||
|
||||
if [[ -n "${last_processed}" ]]; then
|
||||
find "${EVENTS_DIR}" -name "*.json" -mtime +3 | while IFS= read -r f; do
|
||||
# Only delete files that sort before the checkpoint path
|
||||
# (i.e., the observer has already processed them).
|
||||
if [[ "$f" < "${last_processed}" ]]; then
|
||||
rm -f "$f"
|
||||
log "Cleaned old event: $(basename "$f")"
|
||||
fi
|
||||
done
|
||||
else
|
||||
log "No observer checkpoint set; skipping event file cleanup"
|
||||
fi
|
||||
fi
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Main
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
mkdir -p "${EVENTS_DIR}/${NODE_NAME}" "${STATE_DIR}"
|
||||
|
||||
log "Health check starting on ${NODE_NAME} (type=$(get_node_type))"
|
||||
|
||||
disk_pct=$(check_disk || echo 0)
|
||||
mem_pct=$(check_memory || echo 0)
|
||||
cpu_pct=$(check_cpu || echo 0)
|
||||
check_containers
|
||||
|
||||
run_safe_cleanup
|
||||
|
||||
# VPS: also rotate control-plane filesystem artefacts
|
||||
if [[ "${NODE_NAME}" == "vps" ]]; then
|
||||
cleanup_control_plane_fs
|
||||
fi
|
||||
|
||||
# Emit a node_health heartbeat so the observer can update node status
|
||||
# and the supervisor can see up-to-date resource metrics.
|
||||
emit_event "node_health" "info" "" \
|
||||
"Health check completed on ${NODE_NAME}" \
|
||||
"{\"disk_pct\": ${disk_pct}, \"mem_pct\": ${mem_pct}, \"cpu_pct\": ${cpu_pct}}"
|
||||
|
||||
log "Health check complete (disk=${disk_pct}% mem=${mem_pct}% cpu=${cpu_pct}%)"
|
||||
520
scripts/observer/observer.py
Normal file
520
scripts/observer/observer.py
Normal file
|
|
@ -0,0 +1,520 @@
|
|||
import os
|
||||
import json
|
||||
import time
|
||||
import glob
|
||||
import logging
|
||||
import yaml
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
def _atomic_write_json(path: Path, data) -> None:
|
||||
"""Write JSON atomically: write to a sibling .tmp, fsync, then os.replace."""
|
||||
tmp = path.with_suffix(".tmp")
|
||||
with open(tmp, "w") as f:
|
||||
json.dump(data, f, indent=2)
|
||||
f.flush()
|
||||
os.fsync(f.fileno())
|
||||
os.replace(tmp, path)
|
||||
|
||||
|
||||
def _parse_ts(ts) -> float:
|
||||
"""Return a Unix timestamp float from ts, which may be int/float or an ISO-8601 string.
|
||||
|
||||
Events from node-agent use int(time.time()); events from stability-agent / events.py
|
||||
use ISO format ('2026-06-03T10:30:00Z'). Both appear in incident fields such as
|
||||
last_occurrence and resolved_at, so any arithmetic on them must go through here.
|
||||
Returns 0.0 on None or unparseable input so callers can use plain comparisons.
|
||||
"""
|
||||
if ts is None:
|
||||
return 0.0
|
||||
if isinstance(ts, (int, float)):
|
||||
return float(ts)
|
||||
try:
|
||||
return datetime.fromisoformat(str(ts).replace("Z", "+00:00")).timestamp()
|
||||
except Exception:
|
||||
return 0.0
|
||||
|
||||
# Constants and Paths
|
||||
RUNTIME_PATH = os.getenv("RUNTIME_PATH", "/opt/homelab")
|
||||
EVENTS_DIR = Path(RUNTIME_PATH) / "events"
|
||||
STATE_DIR = Path(RUNTIME_PATH) / "state"
|
||||
LOGS_DIR = Path(RUNTIME_PATH) / "logs"
|
||||
WORLD_DIR = Path(RUNTIME_PATH) / "world"
|
||||
OBSERVER_STATE_FILE = STATE_DIR / "observer_checkpoint.json"
|
||||
|
||||
REPO_ROOT = Path(__file__).parent.parent.parent
|
||||
INVENTORY_TOPOLOGY = REPO_ROOT / "inventory" / "topology.yaml"
|
||||
|
||||
# Logging setup
|
||||
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
|
||||
logger = logging.getLogger("observer")
|
||||
|
||||
class Observer:
|
||||
def __init__(self):
|
||||
# Per-node-directory checkpoint: {"vps": "last/file/path", "piha": "last/file/path"}
|
||||
# Replaces the old single last_processed_file which silently skipped event dirs
|
||||
# that sort alphabetically before the checkpoint (e.g. piha/ < vps/).
|
||||
self.node_checkpoints: dict = {}
|
||||
self.world_state = {
|
||||
"nodes": {},
|
||||
"services": {},
|
||||
"deployments": {},
|
||||
"incidents": {},
|
||||
"summary": {
|
||||
"last_update": datetime.now(timezone.utc).isoformat(),
|
||||
"status": "initializing",
|
||||
"active_incidents_count": 0
|
||||
}
|
||||
}
|
||||
self.inventory = self._load_inventory()
|
||||
self._ensure_dirs()
|
||||
self._load_checkpoint()
|
||||
|
||||
def _ensure_dirs(self):
|
||||
WORLD_DIR.mkdir(parents=True, exist_ok=True)
|
||||
STATE_DIR.mkdir(parents=True, exist_ok=True)
|
||||
EVENTS_DIR.mkdir(parents=True, exist_ok=True)
|
||||
LOGS_DIR.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
def _load_inventory(self):
|
||||
inventory = {"nodes": {}, "services": {}}
|
||||
try:
|
||||
if INVENTORY_TOPOLOGY.exists():
|
||||
with open(INVENTORY_TOPOLOGY, "r") as f:
|
||||
topo = yaml.safe_load(f)
|
||||
for node_name, node_info in topo.get("nodes", {}).items():
|
||||
inventory["nodes"][node_name] = {
|
||||
"roles": node_info.get("roles", []),
|
||||
"connectivity": node_info.get("connectivity", {})
|
||||
}
|
||||
|
||||
# Load service assignments from hosts files
|
||||
hosts_dir = REPO_ROOT / "hosts"
|
||||
for host_dir in hosts_dir.iterdir():
|
||||
if host_dir.is_dir():
|
||||
svc_file = host_dir / "services.yaml"
|
||||
if svc_file.exists():
|
||||
with open(svc_file, "r") as f:
|
||||
svc_data = yaml.safe_load(f)
|
||||
host_name = svc_data.get("host")
|
||||
for svc_name, svc_info in svc_data.get("services", {}).items():
|
||||
if host_name not in inventory["services"]:
|
||||
inventory["services"][host_name] = {}
|
||||
inventory["services"][host_name][svc_name] = {
|
||||
"role": svc_info.get("role"),
|
||||
"exposure": svc_info.get("exposure")
|
||||
}
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to load inventory: {e}")
|
||||
return inventory
|
||||
|
||||
def _load_checkpoint(self):
|
||||
if OBSERVER_STATE_FILE.exists():
|
||||
try:
|
||||
with open(OBSERVER_STATE_FILE, "r") as f:
|
||||
checkpoint = json.load(f)
|
||||
|
||||
if "node_checkpoints" in checkpoint:
|
||||
# New format: per-directory checkpoints.
|
||||
self.node_checkpoints = checkpoint["node_checkpoints"]
|
||||
elif "last_processed_file" in checkpoint:
|
||||
# Migrate old single-file checkpoint: extract node dir from path.
|
||||
old = checkpoint["last_processed_file"]
|
||||
if old:
|
||||
try:
|
||||
node_dir = Path(old).relative_to(EVENTS_DIR).parts[0]
|
||||
self.node_checkpoints = {node_dir: old}
|
||||
logger.info(f"Migrated old checkpoint → node_checkpoints: {self.node_checkpoints}")
|
||||
except Exception:
|
||||
pass # Bad path — start fresh
|
||||
|
||||
self._load_world_from_disk()
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to load checkpoint: {e}")
|
||||
|
||||
def _load_world_from_disk(self):
|
||||
# Optional: Load existing state to resume faster
|
||||
files = {
|
||||
"nodes": WORLD_DIR / "nodes.json",
|
||||
"services": WORLD_DIR / "services.json",
|
||||
"deployments": WORLD_DIR / "deployments.json",
|
||||
"incidents": WORLD_DIR / "incidents.json",
|
||||
"summary": WORLD_DIR / "runtime-summary.json"
|
||||
}
|
||||
for key, path in files.items():
|
||||
if path.exists():
|
||||
try:
|
||||
with open(path, "r") as f:
|
||||
self.world_state[key] = json.load(f)
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to load {key} state: {e}")
|
||||
|
||||
def _save_checkpoint(self):
|
||||
try:
|
||||
_atomic_write_json(OBSERVER_STATE_FILE, {"node_checkpoints": self.node_checkpoints})
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to save checkpoint: {e}")
|
||||
|
||||
def _prune_stale_world(self):
|
||||
"""Remove world-state entries for nodes absent from the topology inventory.
|
||||
|
||||
Root cause this guards against: when NODE_NAME env var is unset, node_agent.py
|
||||
falls back to socket.gethostname(), which inside a Docker container returns the
|
||||
12-char hex container ID (e.g. 'be17cb6eb0f6') instead of the canonical host name
|
||||
('vps'). The observer ingests those events and creates ghost entries that never
|
||||
expire on their own.
|
||||
|
||||
Also ages out resolved incidents older than 7 days to keep world state lean.
|
||||
"""
|
||||
known_nodes = set(self.inventory["nodes"].keys())
|
||||
if not known_nodes:
|
||||
# Inventory failed to load — don't prune to avoid wiping valid state.
|
||||
return
|
||||
|
||||
stale_nodes = [n for n in list(self.world_state["nodes"].keys())
|
||||
if n not in known_nodes]
|
||||
for n in stale_nodes:
|
||||
logger.info(f"Pruning stale node from world state: {n}")
|
||||
del self.world_state["nodes"][n]
|
||||
|
||||
stale_svcs = [k for k in list(self.world_state["services"].keys())
|
||||
if k.split("/")[0] in stale_nodes]
|
||||
for k in stale_svcs:
|
||||
logger.info(f"Pruning stale service from world state: {k}")
|
||||
del self.world_state["services"][k]
|
||||
|
||||
# Prune ghost service keys whose service-name portion is a hash-prefixed
|
||||
# Docker stale-state artifact (e.g. "9e36297651e7_control-plane-observer").
|
||||
# These are created when node-agent incorrectly uses c.name instead of the
|
||||
# compose label, and accumulate on every container rebuild.
|
||||
# Pattern: <node>/<12hexchars>_<real-name>
|
||||
ghost_svcs = [
|
||||
k for k in list(self.world_state["services"].keys())
|
||||
if len(k.split("/", 1)) == 2
|
||||
and len(k.split("/", 1)[1]) > 13
|
||||
and k.split("/", 1)[1][12] == "_"
|
||||
and all(ch in "0123456789abcdef" for ch in k.split("/", 1)[1][:12])
|
||||
]
|
||||
for k in ghost_svcs:
|
||||
logger.info(f"Pruning ghost (hash-prefixed) service key from world state: {k}")
|
||||
del self.world_state["services"][k]
|
||||
|
||||
now = time.time()
|
||||
|
||||
try:
|
||||
# Collect incident_ids currently referenced by any service entry.
|
||||
linked_ids: set = {
|
||||
svc.get("incident_id")
|
||||
for svc in self.world_state["services"].values()
|
||||
if svc.get("incident_id")
|
||||
}
|
||||
|
||||
# Case 1 — service is healthy but still points at an active incident.
|
||||
# process_event already calls _resolve_incident on service_healthy events,
|
||||
# but if the observer restarted with on-disk state where the link was
|
||||
# intact (inconsistency from a pre-atomic-write crash), it may not get
|
||||
# resolved until the next service_healthy event is processed. Resolve
|
||||
# immediately — a healthy service cannot have an ongoing incident.
|
||||
for svc_key, svc in self.world_state["services"].items():
|
||||
if svc.get("status") != "healthy":
|
||||
continue
|
||||
inc_id = svc.get("incident_id")
|
||||
if not inc_id:
|
||||
continue
|
||||
inc = self.world_state["incidents"].get(inc_id, {})
|
||||
if inc.get("status") == "active":
|
||||
logger.info(
|
||||
f"Auto-resolving incident {inc_id} for {svc_key}: "
|
||||
f"service is healthy"
|
||||
)
|
||||
inc["status"] = "resolved"
|
||||
inc["resolved_at"] = now
|
||||
svc["incident_id"] = None
|
||||
linked_ids.discard(inc_id)
|
||||
|
||||
# Case 2 — orphaned active incident: no service entry links to it and
|
||||
# last_occurrence is older than 5 minutes (guard against creation races).
|
||||
# These are the stale records left behind when on-disk state was
|
||||
# inconsistent: the service entry had incident_id cleared but incidents.json
|
||||
# still had the record as "active".
|
||||
for inc_id, inc in self.world_state["incidents"].items():
|
||||
if inc.get("status") != "active":
|
||||
continue
|
||||
if inc_id in linked_ids:
|
||||
continue
|
||||
age = now - _parse_ts(inc.get("last_occurrence"))
|
||||
if age > 300: # 5-minute guard
|
||||
logger.info(
|
||||
f"Auto-resolving orphaned incident {inc_id} "
|
||||
f"(service={inc.get('service')}, node={inc.get('node')}): "
|
||||
f"no service references it, age={int(age)}s"
|
||||
)
|
||||
inc["status"] = "resolved"
|
||||
inc["resolved_at"] = now
|
||||
|
||||
except Exception as exc:
|
||||
logger.error(f"Error during incident auto-resolve in _prune_stale_world: {exc}")
|
||||
|
||||
# Remove resolved incidents older than 7 days.
|
||||
# Use _parse_ts so ISO-string resolved_at values are handled correctly.
|
||||
stale_incidents = [
|
||||
k for k, v in self.world_state["incidents"].items()
|
||||
if v.get("status") == "resolved"
|
||||
and now - _parse_ts(v.get("resolved_at")) > 7 * 86400
|
||||
]
|
||||
for k in stale_incidents:
|
||||
del self.world_state["incidents"][k]
|
||||
|
||||
def _save_world(self):
|
||||
self.world_state["summary"]["last_update"] = datetime.now(timezone.utc).isoformat()
|
||||
active_incidents = [
|
||||
k for k, v in self.world_state["incidents"].items() if v.get("status") == "active"
|
||||
]
|
||||
self.world_state["summary"]["active_incidents_count"] = len(active_incidents)
|
||||
self.world_state["summary"]["node_count"] = len(self.world_state["nodes"])
|
||||
self.world_state["summary"]["service_count"] = len(self.world_state["services"])
|
||||
|
||||
if active_incidents:
|
||||
self.world_state["summary"]["status"] = "degraded"
|
||||
else:
|
||||
self.world_state["summary"]["status"] = "nominal"
|
||||
|
||||
files = {
|
||||
"nodes.json": self.world_state["nodes"],
|
||||
"services.json": self.world_state["services"],
|
||||
"deployments.json": self.world_state["deployments"],
|
||||
"incidents.json": self.world_state["incidents"],
|
||||
"recommendations.json": [],
|
||||
"runtime-summary.json": self.world_state["summary"]
|
||||
}
|
||||
for filename, data in files.items():
|
||||
try:
|
||||
_atomic_write_json(WORLD_DIR / filename, data)
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to save {filename}: {e}")
|
||||
|
||||
def process_event(self, event):
|
||||
etype = event.get("type")
|
||||
node = event.get("node")
|
||||
service = event.get("service")
|
||||
severity = event.get("severity")
|
||||
timestamp = event.get("timestamp")
|
||||
cid = event.get("correlation_id")
|
||||
payload = event.get("payload", {})
|
||||
|
||||
# 1. Update Node State
|
||||
if node not in self.world_state["nodes"]:
|
||||
self.world_state["nodes"][node] = {
|
||||
"status": "unknown",
|
||||
"last_seen": None,
|
||||
"roles": self.inventory["nodes"].get(node, {}).get("roles", [])
|
||||
}
|
||||
self.world_state["nodes"][node]["last_seen"] = timestamp
|
||||
|
||||
if etype == "node_online":
|
||||
self.world_state["nodes"][node]["status"] = "online"
|
||||
elif etype == "node_offline":
|
||||
self.world_state["nodes"][node]["status"] = "offline"
|
||||
|
||||
elif etype == "node_health":
|
||||
# Regular heartbeat from node-agent; updates resource metrics.
|
||||
# Clears disk_pressure if disk is now healthy (< warn threshold).
|
||||
self.world_state["nodes"][node]["status"] = "online"
|
||||
self.world_state["nodes"][node].update({
|
||||
"disk_usage_pct": payload.get("disk_pct"),
|
||||
"mem_usage_pct": payload.get("mem_pct"),
|
||||
"cpu_usage_pct": payload.get("cpu_pct"),
|
||||
})
|
||||
if (payload.get("disk_pct") or 0) < 75:
|
||||
self.world_state["nodes"][node].pop("disk_pressure", None)
|
||||
|
||||
elif etype == "disk_pressure":
|
||||
# Emitted when disk usage crosses 75 % (medium) or 85 % (high).
|
||||
# The supervisor reads disk_pressure to generate disk_cleanup actions.
|
||||
self.world_state["nodes"][node]["disk_pressure"] = severity
|
||||
self.world_state["nodes"][node]["disk_usage_pct"] = payload.get("usage_pct")
|
||||
|
||||
elif etype == "high_memory":
|
||||
# Memory pressure observation; recorded on the node for correlation.
|
||||
# No automated action — operator decides if a container restart helps.
|
||||
self.world_state["nodes"][node]["memory_pressure"] = severity
|
||||
self.world_state["nodes"][node]["mem_usage_pct"] = payload.get("usage_pct")
|
||||
|
||||
elif etype == "high_cpu":
|
||||
# CPU pressure observation; recorded for visibility.
|
||||
self.world_state["nodes"][node]["cpu_pressure"] = severity
|
||||
self.world_state["nodes"][node]["cpu_usage_pct"] = payload.get("usage_pct")
|
||||
|
||||
# 2. Update Service State
|
||||
if service and service != "all":
|
||||
svc_key = f"{node}/{service}"
|
||||
if svc_key not in self.world_state["services"]:
|
||||
self.world_state["services"][svc_key] = {
|
||||
"node": node,
|
||||
"service": service,
|
||||
"status": "unknown",
|
||||
"last_check": None,
|
||||
"incident_id": None
|
||||
}
|
||||
self.world_state["services"][svc_key]["last_check"] = timestamp
|
||||
|
||||
if etype == "service_recovered":
|
||||
self.world_state["services"][svc_key]["status"] = "healthy"
|
||||
self._resolve_incident(svc_key, timestamp)
|
||||
elif etype == "service_healthy":
|
||||
# Positive confirmation from node-agent that a managed container
|
||||
# is running. This keeps services.json populated so the supervisor
|
||||
# can correctly detect drift (absent entry = never reported = unknown,
|
||||
# not the same as confirmed missing).
|
||||
# Also resolve any active incident — if a service that had been
|
||||
# unhealthy/crashing is now confirmed healthy, the incident is over.
|
||||
self.world_state["services"][svc_key]["status"] = "healthy"
|
||||
self._resolve_incident(svc_key, timestamp)
|
||||
elif etype in ["service_unhealthy", "healthcheck_failed"]:
|
||||
self.world_state["services"][svc_key]["status"] = "unhealthy"
|
||||
self._handle_incident(svc_key, event)
|
||||
|
||||
# 3. Update Deployment State
|
||||
if etype.startswith("deployment_") and cid:
|
||||
if cid not in self.world_state["deployments"]:
|
||||
self.world_state["deployments"][cid] = {
|
||||
"node": node,
|
||||
"service": service,
|
||||
"status": "unknown",
|
||||
"started_at": None,
|
||||
"finished_at": None,
|
||||
"events": []
|
||||
}
|
||||
self.world_state["deployments"][cid]["events"].append({
|
||||
"type": etype,
|
||||
"timestamp": timestamp,
|
||||
"payload": payload
|
||||
})
|
||||
if etype == "deployment_started":
|
||||
self.world_state["deployments"][cid]["status"] = "in_progress"
|
||||
self.world_state["deployments"][cid]["started_at"] = timestamp
|
||||
elif etype == "deployment_completed":
|
||||
self.world_state["deployments"][cid]["status"] = "completed"
|
||||
self.world_state["deployments"][cid]["finished_at"] = timestamp
|
||||
elif etype == "deployment_failed":
|
||||
self.world_state["deployments"][cid]["status"] = "failed"
|
||||
self.world_state["deployments"][cid]["finished_at"] = timestamp
|
||||
# Deployment failure often creates an incident
|
||||
self._handle_deployment_failure(event)
|
||||
|
||||
def _handle_incident(self, svc_key, event):
|
||||
# Correlation: collapse repeated failures for the same service on the same node
|
||||
active_incident = self.world_state["services"][svc_key].get("incident_id")
|
||||
|
||||
if active_incident and active_incident in self.world_state["incidents"]:
|
||||
incident = self.world_state["incidents"][active_incident]
|
||||
if incident["status"] == "active":
|
||||
incident["last_occurrence"] = event["timestamp"]
|
||||
incident["occurrence_count"] = incident.get("occurrence_count", 1) + 1
|
||||
incident["events"].append(event["timestamp"])
|
||||
return
|
||||
|
||||
# Create new incident
|
||||
incident_id = f"inc-{int(time.time())}-{event.get('node')}-{event.get('service')}"
|
||||
self.world_state["incidents"][incident_id] = {
|
||||
"id": incident_id,
|
||||
"node": event.get("node"),
|
||||
"service": event.get("service"),
|
||||
"status": "active",
|
||||
"severity": event.get("severity"),
|
||||
# trigger_type records the event type that opened this incident so that
|
||||
# the supervisor can choose the appropriate remediation action
|
||||
# (e.g. container_restart for containers_not_running / mqtt_unreachable
|
||||
# vs. a full redeploy for other causes).
|
||||
"trigger_type": event.get("type"),
|
||||
"started_at": event.get("timestamp"),
|
||||
"last_occurrence": event.get("timestamp"),
|
||||
"occurrence_count": 1,
|
||||
"events": [event["timestamp"]],
|
||||
"correlation_id": event.get("correlation_id")
|
||||
}
|
||||
self.world_state["services"][svc_key]["incident_id"] = incident_id
|
||||
|
||||
def _resolve_incident(self, svc_key, timestamp):
|
||||
incident_id = self.world_state["services"][svc_key].get("incident_id")
|
||||
if incident_id and incident_id in self.world_state["incidents"]:
|
||||
if self.world_state["incidents"][incident_id]["status"] == "active":
|
||||
self.world_state["incidents"][incident_id]["status"] = "resolved"
|
||||
self.world_state["incidents"][incident_id]["resolved_at"] = timestamp
|
||||
self.world_state["services"][svc_key]["incident_id"] = None
|
||||
|
||||
def _handle_deployment_failure(self, event):
|
||||
# Specific logic for deployment failures
|
||||
svc_key = f"{event.get('node')}/{event.get('service')}"
|
||||
self._handle_incident(svc_key, event)
|
||||
|
||||
# Link diagnostics if available in payload
|
||||
incident_id = self.world_state["services"][svc_key].get("incident_id")
|
||||
if incident_id and incident_id in self.world_state["incidents"]:
|
||||
payload = event.get("payload", {})
|
||||
if "diagnostics_file" in payload:
|
||||
self.world_state["incidents"][incident_id]["diagnostics_ref"] = payload["diagnostics_file"]
|
||||
elif "error" in payload:
|
||||
self.world_state["incidents"][incident_id]["last_error"] = payload["error"]
|
||||
|
||||
def run_once(self):
|
||||
# Update heartbeat
|
||||
heartbeat_file = STATE_DIR / "observer.heartbeat"
|
||||
try:
|
||||
heartbeat_file.touch()
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to touch heartbeat file: {e}")
|
||||
|
||||
# Collect all event files grouped by node directory.
|
||||
# Per-node checkpoints are compared within each directory independently,
|
||||
# so late-arriving events from remote nodes (sorted earlier in the path)
|
||||
# are never skipped just because another node's checkpoint is further ahead.
|
||||
all_files = sorted(glob.glob(str(EVENTS_DIR / "**" / "*.json"), recursive=True))
|
||||
|
||||
new_files = []
|
||||
for file_path in all_files:
|
||||
try:
|
||||
node_dir = str(Path(file_path).relative_to(EVENTS_DIR).parts[0])
|
||||
except (IndexError, ValueError):
|
||||
node_dir = "__unknown__"
|
||||
last_for_node = self.node_checkpoints.get(node_dir, "")
|
||||
if file_path > last_for_node:
|
||||
new_files.append((node_dir, file_path))
|
||||
|
||||
if not new_files:
|
||||
# Even if no new events, prune stale entries and refresh summary freshness.
|
||||
self._prune_stale_world()
|
||||
self._save_world()
|
||||
return
|
||||
|
||||
logger.info(f"Processing {len(new_files)} new events across "
|
||||
f"{len({n for n, _ in new_files})} node(s)")
|
||||
for node_dir, file_path in new_files:
|
||||
try:
|
||||
with open(file_path, "r") as f:
|
||||
event = json.load(f)
|
||||
self.process_event(event)
|
||||
# Advance per-node checkpoint (only forward — no regression).
|
||||
if file_path > self.node_checkpoints.get(node_dir, ""):
|
||||
self.node_checkpoints[node_dir] = file_path
|
||||
except Exception as e:
|
||||
logger.error(f"Error processing {file_path}: {e}")
|
||||
|
||||
self._save_checkpoint()
|
||||
self._prune_stale_world()
|
||||
self._save_world()
|
||||
|
||||
def loop(self, interval=5):
|
||||
logger.info("Starting observer loop")
|
||||
while True:
|
||||
self.run_once()
|
||||
time.sleep(interval)
|
||||
|
||||
if __name__ == "__main__":
|
||||
import sys
|
||||
observer = Observer()
|
||||
if "--run-once" in sys.argv:
|
||||
observer.run_once()
|
||||
else:
|
||||
observer.loop()
|
||||
83
scripts/observer/test_setup.sh
Normal file
83
scripts/observer/test_setup.sh
Normal file
|
|
@ -0,0 +1,83 @@
|
|||
#!/usr/bin/env bash
|
||||
mkdir -p /tmp/homelab/events/2026-05-12/saturn
|
||||
mkdir -p /tmp/homelab/state
|
||||
mkdir -p /tmp/homelab/logs
|
||||
mkdir -p /tmp/homelab/world
|
||||
|
||||
cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/120000_node_online_1.json
|
||||
{
|
||||
"timestamp": "2026-05-12T12:00:00Z",
|
||||
"node": "saturn",
|
||||
"type": "node_online",
|
||||
"severity": "info",
|
||||
"source": "system",
|
||||
"service": "all",
|
||||
"correlation_id": "init",
|
||||
"payload": {}
|
||||
}
|
||||
EOF
|
||||
|
||||
cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/120500_service_unhealthy_1.json
|
||||
{
|
||||
"timestamp": "2026-05-12T12:05:00Z",
|
||||
"node": "saturn",
|
||||
"type": "service_unhealthy",
|
||||
"severity": "error",
|
||||
"source": "healthcheck",
|
||||
"service": "mosquitto",
|
||||
"correlation_id": "hc-1",
|
||||
"payload": {"error": "connection refused"}
|
||||
}
|
||||
EOF
|
||||
|
||||
cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/120600_service_unhealthy_2.json
|
||||
{
|
||||
"timestamp": "2026-05-12T12:06:00Z",
|
||||
"node": "saturn",
|
||||
"type": "service_unhealthy",
|
||||
"severity": "error",
|
||||
"source": "healthcheck",
|
||||
"service": "mosquitto",
|
||||
"correlation_id": "hc-2",
|
||||
"payload": {"error": "connection refused"}
|
||||
}
|
||||
EOF
|
||||
|
||||
cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/121000_service_recovered_1.json
|
||||
{
|
||||
"timestamp": "2026-05-12T12:10:00Z",
|
||||
"node": "saturn",
|
||||
"type": "service_recovered",
|
||||
"severity": "info",
|
||||
"source": "healthcheck",
|
||||
"service": "mosquitto",
|
||||
"correlation_id": "hc-3",
|
||||
"payload": {}
|
||||
}
|
||||
EOF
|
||||
|
||||
cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/121500_deployment_started_1.json
|
||||
{
|
||||
"timestamp": "2026-05-12T12:15:00Z",
|
||||
"node": "saturn",
|
||||
"type": "deployment_started",
|
||||
"severity": "info",
|
||||
"source": "deploy_agent",
|
||||
"service": "mosquitto",
|
||||
"correlation_id": "deploy-1",
|
||||
"payload": {"version": "2.0.18"}
|
||||
}
|
||||
EOF
|
||||
|
||||
cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/121600_deployment_failed_1.json
|
||||
{
|
||||
"timestamp": "2026-05-12T12:16:00Z",
|
||||
"node": "saturn",
|
||||
"type": "deployment_failed",
|
||||
"severity": "error",
|
||||
"source": "deploy_agent",
|
||||
"service": "mosquitto",
|
||||
"correlation_id": "deploy-1",
|
||||
"payload": {"error": "container crash", "diagnostics_file": "/opt/homelab/logs/diagnostics-deploy-1.log"}
|
||||
}
|
||||
EOF
|
||||
55
services/agent-system/README.md
Normal file
55
services/agent-system/README.md
Normal file
|
|
@ -0,0 +1,55 @@
|
|||
### Agent System
|
||||
Central runtime materializer and Operator Control Plane UI.
|
||||
|
||||
#### Components
|
||||
- **Redis**: Central state store (on PIHA).
|
||||
- **Runtime Materializer**: Converts Redis state to JSON files in `/opt/homelab/world`.
|
||||
- **Web UI**: Exposes API endpoints and serving the Operator UI.
|
||||
- **Telegram Bot**: Provides operator commands and action approvals via Telegram.
|
||||
|
||||
#### Configuration
|
||||
Environment variables should be set in `.env` (see `env.example`).
|
||||
Key variables for the Telegram Bot:
|
||||
- `TELEGRAM_BOT_TOKEN`: Your bot token from @BotFather.
|
||||
- `TELEGRAM_ALLOWED_USER_IDS`: Comma-separated list of authorized Telegram User IDs.
|
||||
- `CONTROL_PLANE_URL`: URL to the `agent-system-webui` (default: `http://webui:8080`).
|
||||
|
||||
#### Telegram Commands
|
||||
- `/status`: Check bot and API connectivity.
|
||||
- `/summary`: System health overview.
|
||||
- `/nodes`: List homelab nodes and their status.
|
||||
- `/services`: Summary of services across nodes.
|
||||
- `/unhealthy`: List all unhealthy components.
|
||||
- `/incidents`: View active incidents.
|
||||
- `/actions`: Summary of operator actions.
|
||||
- `/help`: List all commands.
|
||||
|
||||
#### Deployment (on PIHA)
|
||||
```bash
|
||||
cd services/agent-system
|
||||
./deploy.sh
|
||||
```
|
||||
|
||||
#### Deployment (on CHELSTY)
|
||||
```bash
|
||||
cd services/stability-agent
|
||||
docker compose up -d --build
|
||||
```
|
||||
|
||||
#### Verification
|
||||
The `deploy.sh` script automatically verifies the local endpoints.
|
||||
You can also manually check:
|
||||
```bash
|
||||
# Check runtime summary
|
||||
curl http://localhost:18180/summary
|
||||
|
||||
# Check discovered nodes
|
||||
curl http://localhost:18180/nodes
|
||||
|
||||
# Check discovered services
|
||||
curl http://localhost:18180/services
|
||||
```
|
||||
|
||||
#### Directory Structure
|
||||
- `/opt/homelab/world`: Contains materialized JSON state.
|
||||
- `/opt/homelab/state`: Contains operator configuration and local heartbeats.
|
||||
52
services/agent-system/action-model.md
Normal file
52
services/agent-system/action-model.md
Normal file
|
|
@ -0,0 +1,52 @@
|
|||
### Action Approval Data Model
|
||||
|
||||
Actions are JSON files stored in `/opt/homelab/actions/{status}/{action_id}.json`.
|
||||
|
||||
#### Statuses
|
||||
- `pending`: Waiting for operator approval. AI agents create actions in this state.
|
||||
- `approved`: Approved by operator, ready for execution.
|
||||
- `rejected`: Rejected by operator, will not be executed.
|
||||
- `running`: Currently being executed by an agent (e.g. `materializer`).
|
||||
- `completed`: Successfully executed.
|
||||
- `failed`: Execution failed.
|
||||
|
||||
#### Human-in-the-Loop (HIL) Protocol
|
||||
1. **Request**: Agent identifies a required change and writes a JSON to `actions/pending/`.
|
||||
2. **Notification**: System notifies the human operator.
|
||||
3. **Audit**: Human reviews `details.reason` and `details.diff`.
|
||||
4. **Authorization**: Human moves file to `approved/`.
|
||||
5. **Execution**: Agent monitors `approved/` and executes the task.
|
||||
|
||||
#### Schema
|
||||
```json
|
||||
{
|
||||
"action_id": "string",
|
||||
"service": "string",
|
||||
"node": "string",
|
||||
"type": "deploy_service | restart_service | rollback | scale",
|
||||
"risk": "nominal | guarded | critical",
|
||||
"status": "pending | approved | rejected | ...",
|
||||
"created_at": <unix_seconds>,
|
||||
"updated_at": <unix_seconds>,
|
||||
"details": {
|
||||
"image": "string",
|
||||
"reason": "string",
|
||||
"diff": "string"
|
||||
},
|
||||
"transition_history": [
|
||||
{
|
||||
"from": "string | null",
|
||||
"to": "string",
|
||||
"timestamp": <unix_seconds>,
|
||||
"by": "string (system | operator-tg-12345 | webui)"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
#### Workflow
|
||||
1. A system component (e.g. `runtime-materializer` or a future analyzer) creates a file in `actions/pending/`.
|
||||
2. `telegram-bot` detects the file, sends a message to allowed users.
|
||||
3. Operator clicks "Approve" or "Reject".
|
||||
4. `telegram-bot` moves the file to `actions/approved/` or `actions/rejected/` atomically, appending a transition to `transition_history`.
|
||||
5. The responsible agent (e.g. `stability-agent` on the target node) picks up the `approved` action, moves it to `running`, executes it, and finally moves it to `completed` or `failed`.
|
||||
28
services/agent-system/deploy.sh
Executable file
28
services/agent-system/deploy.sh
Executable file
|
|
@ -0,0 +1,28 @@
|
|||
#!/bin/bash
|
||||
set -e
|
||||
|
||||
echo ">>> Validating docker-compose configuration..."
|
||||
docker compose config
|
||||
|
||||
echo ">>> Building and starting Agent System services..."
|
||||
docker compose up -d --build
|
||||
|
||||
echo ">>> Services status:"
|
||||
docker ps --filter "name=agent-system" --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"
|
||||
|
||||
if [ -z "$TELEGRAM_BOT_TOKEN" ]; then
|
||||
echo ">>> Telegram bot status: DISABLED (token missing)"
|
||||
else
|
||||
echo ">>> Telegram bot status: ENABLED"
|
||||
fi
|
||||
|
||||
echo ">>> Verifying API endpoints..."
|
||||
sleep 5 # Give it a moment to start
|
||||
|
||||
endpoints=("summary" "nodes" "services")
|
||||
for ep in "${endpoints[@]}"; do
|
||||
echo "Checking /$ep..."
|
||||
curl -s -f http://localhost:18180/$ep > /dev/null && echo " OK" || echo " FAILED"
|
||||
done
|
||||
|
||||
echo ">>> Deployment complete."
|
||||
47
services/agent-system/docker-compose.yml
Normal file
47
services/agent-system/docker-compose.yml
Normal file
|
|
@ -0,0 +1,47 @@
|
|||
services:
|
||||
redis:
|
||||
image: redis:7
|
||||
container_name: agent-system-redis
|
||||
ports:
|
||||
- "6379:6379"
|
||||
restart: unless-stopped
|
||||
|
||||
webui:
|
||||
build: ./webui
|
||||
container_name: agent-system-webui
|
||||
ports:
|
||||
- "18180:8080"
|
||||
volumes:
|
||||
- /opt/homelab:/opt/homelab
|
||||
depends_on:
|
||||
- redis
|
||||
restart: unless-stopped
|
||||
|
||||
runtime-materializer:
|
||||
build: ./runtime-materializer
|
||||
container_name: agent-system-runtime-materializer
|
||||
environment:
|
||||
REDIS_HOST: redis
|
||||
REDIS_PORT: "6379"
|
||||
HOMELAB_WORLD_ROOT: /opt/homelab/world
|
||||
WORLD_DIR: /opt/homelab/world
|
||||
MATERIALIZE_INTERVAL: "10"
|
||||
volumes:
|
||||
- /opt/homelab:/opt/homelab
|
||||
depends_on:
|
||||
- redis
|
||||
restart: unless-stopped
|
||||
|
||||
telegram-bot:
|
||||
build: ./telegram-bot
|
||||
container_name: agent-system-telegram-bot
|
||||
environment:
|
||||
TELEGRAM_BOT_TOKEN: ${TELEGRAM_BOT_TOKEN}
|
||||
TELEGRAM_ALLOWED_USER_IDS: ${TELEGRAM_ALLOWED_USER_IDS}
|
||||
CONTROL_PLANE_URL: ${CONTROL_PLANE_URL:-http://webui:8080}
|
||||
ENABLE_LLM_FALLBACK: ${ENABLE_LLM_FALLBACK:-false}
|
||||
OPENCLAW_BASE_URL: ${OPENCLAW_BASE_URL}
|
||||
ACTIONS_ROOT: /opt/homelab/actions
|
||||
volumes:
|
||||
- /opt/homelab:/opt/homelab
|
||||
restart: on-failure
|
||||
19
services/agent-system/env.example
Normal file
19
services/agent-system/env.example
Normal file
|
|
@ -0,0 +1,19 @@
|
|||
# Telegram Bot Configuration
|
||||
# Get token from @BotFather
|
||||
TELEGRAM_BOT_TOKEN=123456789:ABCdefGHIjklMNOpqrsTUVwxyz
|
||||
# Comma-separated list of Telegram User IDs
|
||||
TELEGRAM_ALLOWED_USER_IDS=12345678,87654321
|
||||
# Local control-plane API (default is internal compose address)
|
||||
CONTROL_PLANE_URL=http://webui:8080
|
||||
# Optional LLM fallback logic
|
||||
ENABLE_LLM_FALLBACK=false
|
||||
OPENCLAW_BASE_URL=http://openclaw.internal
|
||||
|
||||
# Runtime Materializer Configuration
|
||||
REDIS_HOST=100.108.208.3
|
||||
REDIS_PORT=6379
|
||||
|
||||
# Paths
|
||||
HOMELAB_ROOT=/opt/homelab
|
||||
ACTIONS_ROOT=/opt/homelab/actions
|
||||
WORLD_DIR=/opt/homelab/world
|
||||
16
services/agent-system/runtime-materializer/Dockerfile
Normal file
16
services/agent-system/runtime-materializer/Dockerfile
Normal file
|
|
@ -0,0 +1,16 @@
|
|||
FROM python:3.11-slim
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
# Install redis python package as requested
|
||||
RUN pip install --no-cache-dir redis
|
||||
|
||||
COPY materializer.py .
|
||||
|
||||
# Ensure the world directory exists in the container (though it will likely be a volume)
|
||||
RUN mkdir -p /opt/homelab/world
|
||||
|
||||
# Use unbuffered output to see logs in docker
|
||||
ENV PYTHONUNBUFFERED=1
|
||||
|
||||
CMD ["python", "materializer.py"]
|
||||
251
services/agent-system/runtime-materializer/materializer.py
Normal file
251
services/agent-system/runtime-materializer/materializer.py
Normal file
|
|
@ -0,0 +1,251 @@
|
|||
import redis
|
||||
import json
|
||||
import os
|
||||
import time
|
||||
import argparse
|
||||
import urllib.request
|
||||
import urllib.error
|
||||
from datetime import datetime
|
||||
|
||||
# Configuration from environment variables
|
||||
REDIS_HOST = os.environ.get("REDIS_HOST", "redis")
|
||||
REDIS_PORT = int(os.environ.get("REDIS_PORT", 6379))
|
||||
WORLD_DIR = os.environ.get("WORLD_DIR", "/opt/homelab/world")
|
||||
|
||||
# When set, materialize from the control-plane HTTP API instead of Redis.
|
||||
# This is the authoritative source of truth: the observer writes clean world
|
||||
# state to the control-plane API, which the materializer mirrors locally so
|
||||
# the webui's /snapshot (and all other endpoints) reflect the same data.
|
||||
#
|
||||
# Example: CONTROL_PLANE_URL=http://100.95.58.48:18180
|
||||
CONTROL_PLANE_URL = os.environ.get("CONTROL_PLANE_URL", "").rstrip("/")
|
||||
|
||||
|
||||
def get_redis_client():
|
||||
"""Returns a Redis client with decoding enabled."""
|
||||
return redis.Redis(
|
||||
host=REDIS_HOST,
|
||||
port=REDIS_PORT,
|
||||
decode_responses=True,
|
||||
socket_timeout=5
|
||||
)
|
||||
|
||||
def safe_json_loads(data, default=None):
|
||||
"""Safely loads JSON from a string."""
|
||||
if not data:
|
||||
return default
|
||||
try:
|
||||
if isinstance(data, (dict, list)):
|
||||
return data
|
||||
return json.loads(data)
|
||||
except (json.JSONDecodeError, TypeError):
|
||||
return data
|
||||
|
||||
def normalize_health(health):
|
||||
"""Normalizes health values for the UI."""
|
||||
if not health:
|
||||
return "nominal"
|
||||
h = str(health).lower()
|
||||
if h in ["healthy", "ok", "running", "nominal"]:
|
||||
return "nominal"
|
||||
if h in ["degraded", "warning"]:
|
||||
return "degraded"
|
||||
return "error"
|
||||
|
||||
|
||||
def _fetch_json(url):
|
||||
"""Fetch JSON from a URL, returning parsed data or None on error."""
|
||||
try:
|
||||
with urllib.request.urlopen(url, timeout=10) as resp:
|
||||
return json.loads(resp.read())
|
||||
except Exception as e:
|
||||
print(f"[{datetime.now().isoformat()}] Error fetching {url}: {e}")
|
||||
return None
|
||||
|
||||
|
||||
def write_json(filename, data):
|
||||
path = os.path.join(WORLD_DIR, filename)
|
||||
with open(path, "w") as f:
|
||||
json.dump(data, f, indent=2)
|
||||
|
||||
|
||||
def materialize_from_api():
|
||||
"""Mirror world state from the control-plane API to local world files.
|
||||
|
||||
The control-plane observer on VPS is the single authoritative writer of
|
||||
world state. By fetching from its HTTP API we get the same clean, pruned
|
||||
data that the /summary endpoint serves — no stale Redis artefacts.
|
||||
|
||||
Returns True if all fetches succeeded and files were written, False otherwise.
|
||||
"""
|
||||
print(f"[{datetime.now().isoformat()}] Materializing from control-plane API: {CONTROL_PLANE_URL}")
|
||||
|
||||
endpoints = {
|
||||
"nodes.json": f"{CONTROL_PLANE_URL}/nodes",
|
||||
"services.json": f"{CONTROL_PLANE_URL}/services",
|
||||
"incidents.json": f"{CONTROL_PLANE_URL}/incidents",
|
||||
"deployments.json": f"{CONTROL_PLANE_URL}/deployments",
|
||||
"recommendations.json":f"{CONTROL_PLANE_URL}/recommendations",
|
||||
"runtime-summary.json":f"{CONTROL_PLANE_URL}/summary",
|
||||
"events.json": f"{CONTROL_PLANE_URL}/events",
|
||||
}
|
||||
|
||||
fetched = {}
|
||||
for filename, url in endpoints.items():
|
||||
data = _fetch_json(url)
|
||||
if data is None:
|
||||
print(f"[{datetime.now().isoformat()}] Aborting: failed to fetch {url}")
|
||||
return False
|
||||
fetched[filename] = data
|
||||
|
||||
os.makedirs(WORLD_DIR, exist_ok=True)
|
||||
for filename, data in fetched.items():
|
||||
write_json(filename, data)
|
||||
|
||||
svc_count = len(fetched.get("services.json") or [])
|
||||
print(f"[{datetime.now().isoformat()}] Materialized from API: {svc_count} services → {WORLD_DIR}")
|
||||
return True
|
||||
|
||||
|
||||
def materialize():
|
||||
"""Reads state from Redis and writes JSON files to the world directory."""
|
||||
print(f"[{datetime.now().isoformat()}] Materializing world state...")
|
||||
try:
|
||||
r = get_redis_client()
|
||||
|
||||
# 1. Nodes
|
||||
nodes = []
|
||||
node_keys = r.keys("homelab:nodes:*")
|
||||
for key in node_keys:
|
||||
node_data = r.hgetall(key)
|
||||
if node_data:
|
||||
# Normalize health
|
||||
if "health" in node_data:
|
||||
node_data["health"] = normalize_health(node_data["health"])
|
||||
# Parse JSON fields if they exist
|
||||
if "capabilities" in node_data:
|
||||
node_data["capabilities"] = safe_json_loads(node_data["capabilities"], [])
|
||||
if "checks" in node_data:
|
||||
node_data["checks"] = safe_json_loads(node_data["checks"], {})
|
||||
nodes.append(node_data)
|
||||
|
||||
# 2. Services
|
||||
services = []
|
||||
service_keys = r.keys("homelab:services:*")
|
||||
for key in service_keys:
|
||||
svc_data = r.hgetall(key)
|
||||
if svc_data:
|
||||
# Normalize health
|
||||
if "health" in svc_data:
|
||||
svc_data["health"] = normalize_health(svc_data["health"])
|
||||
if "dependencies" in svc_data:
|
||||
svc_data["dependencies"] = safe_json_loads(svc_data["dependencies"], [])
|
||||
if "recommendations" in svc_data:
|
||||
svc_data["recommendations"] = safe_json_loads(svc_data["recommendations"], [])
|
||||
services.append(svc_data)
|
||||
|
||||
# 3. Events (Stream)
|
||||
events = []
|
||||
try:
|
||||
# Get last 100 events from the stream
|
||||
raw_events = r.xrevrange("homelab:events", count=100)
|
||||
for event_id, data in raw_events:
|
||||
event = data.copy()
|
||||
event["id"] = event_id
|
||||
if "details" in event:
|
||||
event["details"] = safe_json_loads(event["details"], {})
|
||||
events.append(event)
|
||||
except redis.exceptions.ResponseError:
|
||||
# homelab:events might not be a stream or doesn't exist
|
||||
pass
|
||||
|
||||
# 4. Incidents (Hash)
|
||||
incidents = []
|
||||
incident_keys = r.keys("homelab:incidents:*")
|
||||
for key in incident_keys:
|
||||
incident_data = r.hgetall(key)
|
||||
if incident_data:
|
||||
# Normalize health if present
|
||||
if "health" in incident_data:
|
||||
incident_data["health"] = normalize_health(incident_data["health"])
|
||||
incidents.append(incident_data)
|
||||
|
||||
# 5. Deployments (Hash)
|
||||
deployments = []
|
||||
deployment_keys = r.keys("homelab:deployments:*")
|
||||
for key in deployment_keys:
|
||||
dep_data = r.hgetall(key)
|
||||
if dep_data:
|
||||
deployments.append(dep_data)
|
||||
|
||||
# 6. Recommendations (Hash)
|
||||
recommendations = []
|
||||
recommendation_keys = r.keys("homelab:recommendations:*")
|
||||
for key in recommendation_keys:
|
||||
rec_data = r.hgetall(key)
|
||||
if rec_data:
|
||||
recommendations.append(rec_data)
|
||||
|
||||
# 7. Runtime Summary
|
||||
unhealthy_services = [s for s in services if s.get("health") != "nominal"]
|
||||
active_incidents = [i for i in incidents if i.get("status") not in ["resolved", "closed"]]
|
||||
|
||||
status = "nominal"
|
||||
if len(active_incidents) > 0 or len(unhealthy_services) > 5:
|
||||
status = "error"
|
||||
elif len(unhealthy_services) > 0:
|
||||
status = "degraded"
|
||||
|
||||
summary = {
|
||||
"status": status,
|
||||
"timestamp": datetime.utcnow().isoformat() + "Z",
|
||||
"last_update": int(time.time()),
|
||||
"node_count": len(nodes),
|
||||
"service_count": len(services),
|
||||
"active_incidents_count": len(active_incidents),
|
||||
"unhealthy_services_count": len(unhealthy_services),
|
||||
"incident_count": len(incidents),
|
||||
"recent_events_count": len(events),
|
||||
"stale": False
|
||||
}
|
||||
|
||||
# Ensure directory exists
|
||||
os.makedirs(WORLD_DIR, exist_ok=True)
|
||||
|
||||
write_json("runtime-summary.json", summary)
|
||||
write_json("nodes.json", nodes)
|
||||
write_json("services.json", services)
|
||||
write_json("incidents.json", incidents)
|
||||
write_json("events.json", events)
|
||||
write_json("deployments.json", deployments)
|
||||
write_json("recommendations.json", recommendations)
|
||||
|
||||
print(f"[{datetime.now().isoformat()}] Successfully materialized to {WORLD_DIR}")
|
||||
|
||||
except redis.exceptions.ConnectionError as e:
|
||||
print(f"Redis connection error: {e}")
|
||||
except Exception as e:
|
||||
print(f"Unexpected error during materialization: {e}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(description="Homelab Runtime Materializer")
|
||||
parser.add_argument("--once", action="store_true", help="Run once and exit")
|
||||
parser.add_argument("--interval", type=int, default=30, help="Sleep interval between runs (seconds)")
|
||||
args = parser.parse_args()
|
||||
|
||||
if CONTROL_PLANE_URL:
|
||||
print(f"Mode: control-plane API ({CONTROL_PLANE_URL})")
|
||||
run_fn = materialize_from_api
|
||||
else:
|
||||
print(f"Mode: Redis ({REDIS_HOST}:{REDIS_PORT})")
|
||||
run_fn = materialize
|
||||
|
||||
interval = int(os.environ.get("MATERIALIZE_INTERVAL", args.interval))
|
||||
|
||||
if args.once:
|
||||
run_fn()
|
||||
else:
|
||||
print(f"Starting materializer loop (interval: {interval}s)...")
|
||||
while True:
|
||||
run_fn()
|
||||
time.sleep(interval)
|
||||
39
services/agent-system/scripts/create-test-action.sh
Executable file
39
services/agent-system/scripts/create-test-action.sh
Executable file
|
|
@ -0,0 +1,39 @@
|
|||
#!/bin/bash
|
||||
# Script to create a test pending action for Telegram bot verification.
|
||||
|
||||
ACTIONS_PENDING_DIR=${ACTIONS_ROOT:-/opt/homelab/actions}/pending
|
||||
mkdir -p "$ACTIONS_PENDING_DIR"
|
||||
|
||||
ACTION_ID="test-$(date +%s)"
|
||||
FILE_PATH="$ACTIONS_PENDING_DIR/$ACTION_ID.json"
|
||||
|
||||
TIMESTAMP=$(date +%s)
|
||||
|
||||
cat <<EOF > "$FILE_PATH"
|
||||
{
|
||||
"action_id": "$ACTION_ID",
|
||||
"service": "frigate",
|
||||
"node": "chelsty",
|
||||
"type": "deploy_service",
|
||||
"risk": "guarded",
|
||||
"status": "pending",
|
||||
"created_at": $TIMESTAMP,
|
||||
"updated_at": $TIMESTAMP,
|
||||
"details": {
|
||||
"image": "blakeblackshear/frigate:0.13.0",
|
||||
"reason": "Security update for Frigate",
|
||||
"diff": "image: blakeblackshear/frigate:0.12.0 -> 0.13.0"
|
||||
},
|
||||
"transition_history": [
|
||||
{
|
||||
"from": null,
|
||||
"to": "pending",
|
||||
"timestamp": $TIMESTAMP,
|
||||
"by": "system-test"
|
||||
}
|
||||
]
|
||||
}
|
||||
EOF
|
||||
|
||||
echo "Test action created: $FILE_PATH"
|
||||
echo "If the telegram-bot is running and configured, you should receive a notification."
|
||||
10
services/agent-system/telegram-bot/Dockerfile
Normal file
10
services/agent-system/telegram-bot/Dockerfile
Normal file
|
|
@ -0,0 +1,10 @@
|
|||
FROM python:3.11-slim
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
COPY requirements.txt .
|
||||
RUN pip install --no-cache-dir -r requirements.txt
|
||||
|
||||
COPY bot.py .
|
||||
|
||||
CMD ["python", "bot.py"]
|
||||
454
services/agent-system/telegram-bot/bot.py
Normal file
454
services/agent-system/telegram-bot/bot.py
Normal file
|
|
@ -0,0 +1,454 @@
|
|||
import os
|
||||
import json
|
||||
import time
|
||||
import asyncio
|
||||
import logging
|
||||
import urllib.request
|
||||
import urllib.error
|
||||
from pathlib import Path
|
||||
from telegram import Update, InlineKeyboardButton, InlineKeyboardMarkup
|
||||
from telegram.ext import ApplicationBuilder, ContextTypes, CommandHandler, CallbackQueryHandler, MessageHandler, filters
|
||||
|
||||
# Setup logging
|
||||
logging.basicConfig(
|
||||
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
|
||||
level=logging.INFO
|
||||
)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Configuration
|
||||
TOKEN = os.getenv("TELEGRAM_BOT_TOKEN")
|
||||
ALLOWED_IDS = [int(i.strip()) for i in os.getenv("TELEGRAM_ALLOWED_USER_IDS", "").split(",") if i.strip()]
|
||||
ACTIONS_ROOT = Path(os.getenv("ACTIONS_ROOT", "/opt/homelab/actions"))
|
||||
CONTROL_PLANE_URL = os.getenv("CONTROL_PLANE_URL", "http://webui:8080")
|
||||
ENABLE_LLM_FALLBACK = os.getenv("ENABLE_LLM_FALLBACK", "false").lower() == "true"
|
||||
OPENCLAW_BASE_URL = os.getenv("OPENCLAW_BASE_URL")
|
||||
|
||||
async def fetch_api(path):
|
||||
"""Helper to fetch JSON from the Control Plane API."""
|
||||
url = f"{CONTROL_PLANE_URL.rstrip('/')}/{path.lstrip('/')}"
|
||||
try:
|
||||
def do_request():
|
||||
req = urllib.request.Request(url)
|
||||
with urllib.request.urlopen(req, timeout=5) as response:
|
||||
if response.status != 200:
|
||||
return None
|
||||
return json.loads(response.read().decode())
|
||||
return await asyncio.to_thread(do_request)
|
||||
except Exception as e:
|
||||
logger.error(f"Error fetching {url}: {e}")
|
||||
return None
|
||||
|
||||
async def post_api(path, data):
|
||||
"""Helper to POST JSON to the Control Plane API."""
|
||||
url = f"{CONTROL_PLANE_URL.rstrip('/')}/{path.lstrip('/')}"
|
||||
try:
|
||||
body = json.dumps(data).encode("utf-8")
|
||||
def do_request():
|
||||
req = urllib.request.Request(url, data=body, method="POST")
|
||||
req.add_header("Content-Type", "application/json")
|
||||
with urllib.request.urlopen(req, timeout=5) as response:
|
||||
return response.status == 200
|
||||
return await asyncio.to_thread(do_request)
|
||||
except Exception as e:
|
||||
logger.error(f"Error posting to {url}: {e}")
|
||||
return False
|
||||
|
||||
def _format_pending_action(action_id: str, data: dict) -> str:
|
||||
"""Build the Telegram Markdown message for a pending action notification.
|
||||
|
||||
Extracted so it can be unit-tested without a live Telegram connection.
|
||||
"""
|
||||
# Supervisor writes risk_level; action-model.md legacy schema used risk.
|
||||
risk = data.get("risk_level") or data.get("risk", "unknown")
|
||||
message = (
|
||||
f"⚠️ *Pending Action*\n"
|
||||
f"ID: `{action_id}`\n"
|
||||
f"Type: `{data.get('type', 'unknown')}`\n"
|
||||
f"Service: `{data.get('service', 'unknown')}`\n"
|
||||
f"Node: `{data.get('node', 'unknown')}`\n"
|
||||
f"Risk: *{risk}*\n"
|
||||
)
|
||||
# description carries the human-readable substance of the action (required for
|
||||
# alert_only actions where it is the entire operator-visible message).
|
||||
description = data.get("description", "")
|
||||
if description:
|
||||
truncated = description[:300] + ("..." if len(description) > 300 else "")
|
||||
message += f"Description: `{truncated}`\n"
|
||||
# Legacy details block (old action-model.md schema) — kept for backwards compat.
|
||||
if "details" in data:
|
||||
details_str = json.dumps(data["details"], indent=2)
|
||||
if len(details_str) > 1000:
|
||||
details_str = details_str[:1000] + "..."
|
||||
message += f"\nDetails:\n```json\n{details_str}\n```"
|
||||
return message
|
||||
|
||||
|
||||
class ApprovalBot:
|
||||
def __init__(self):
|
||||
self.pending_dir = ACTIONS_ROOT / "pending"
|
||||
self.approved_dir = ACTIONS_ROOT / "approved"
|
||||
self.rejected_dir = ACTIONS_ROOT / "rejected"
|
||||
# Track which action IDs we have already notified in this session to avoid spam
|
||||
self.notified_actions = set()
|
||||
|
||||
async def check_pending_actions(self, context: ContextTypes.DEFAULT_TYPE):
|
||||
"""Job that periodically checks for new pending action files."""
|
||||
if not self.pending_dir.exists():
|
||||
return
|
||||
|
||||
try:
|
||||
for action_file in self.pending_dir.glob("*.json"):
|
||||
action_id = action_file.stem
|
||||
if action_id in self.notified_actions:
|
||||
continue
|
||||
|
||||
try:
|
||||
data = json.loads(action_file.read_text())
|
||||
# Only notify if it's truly pending
|
||||
if data.get("status") == "pending":
|
||||
await self.notify_users(context, action_id, data)
|
||||
self.notified_actions.add(action_id)
|
||||
except Exception as e:
|
||||
logger.error(f"Error processing action file {action_file}: {e}")
|
||||
except Exception as e:
|
||||
logger.error(f"Error scanning pending directory: {e}")
|
||||
|
||||
async def notify_users(self, context: ContextTypes.DEFAULT_TYPE, action_id: str, data: dict):
|
||||
"""Sends an approval request message to all allowed users."""
|
||||
message = _format_pending_action(action_id, data)
|
||||
|
||||
keyboard = [
|
||||
[
|
||||
InlineKeyboardButton("✅ Approve", callback_data=f"approve:{action_id}"),
|
||||
InlineKeyboardButton("❌ Reject", callback_data=f"reject:{action_id}"),
|
||||
]
|
||||
]
|
||||
reply_markup = InlineKeyboardMarkup(keyboard)
|
||||
|
||||
for user_id in ALLOWED_IDS:
|
||||
try:
|
||||
await context.bot.send_message(
|
||||
chat_id=user_id,
|
||||
text=message,
|
||||
parse_mode="Markdown",
|
||||
reply_markup=reply_markup
|
||||
)
|
||||
logger.info(f"Notified user {user_id} about action {action_id}")
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to notify user {user_id}: {e}")
|
||||
|
||||
async def handle_callback(self, update: Update, context: ContextTypes.DEFAULT_TYPE):
|
||||
"""Handles button clicks for Approve/Reject."""
|
||||
query = update.callback_query
|
||||
user_id = query.from_user.id
|
||||
|
||||
if user_id not in ALLOWED_IDS:
|
||||
await query.answer("Unauthorized", show_alert=True)
|
||||
return
|
||||
|
||||
await query.answer()
|
||||
|
||||
cb_data = query.data
|
||||
if ":" not in cb_data:
|
||||
return
|
||||
|
||||
action, action_id = cb_data.split(":", 1)
|
||||
target_status = "approved" if action == "approve" else "rejected"
|
||||
|
||||
# Use API for mutation if available, fallback to local disk move
|
||||
success = await post_api("/action/mutate", {"id": action_id, "status": target_status})
|
||||
msg = "Success" if success else "API call failed"
|
||||
|
||||
if not success:
|
||||
# Fallback to direct disk manipulation (original behavior)
|
||||
success, msg = self.move_action(action_id, target_status, user_id, query.from_user.username or str(user_id))
|
||||
|
||||
if success:
|
||||
status_text = "✅ Approved" if target_status == "approved" else "❌ Rejected"
|
||||
await query.edit_message_text(
|
||||
text=query.message.text + f"\n\n{status_text} by {query.from_user.first_name}",
|
||||
parse_mode="Markdown"
|
||||
)
|
||||
# Remove from notified list as it's no longer pending
|
||||
if action_id in self.notified_actions:
|
||||
self.notified_actions.remove(action_id)
|
||||
else:
|
||||
await query.message.reply_text(f"Failed to process action {action_id}: {msg}")
|
||||
|
||||
def move_action(self, action_id, target_status, user_id, username):
|
||||
"""Moves action file and updates its status and history."""
|
||||
source_path = self.pending_dir / f"{action_id}.json"
|
||||
if not source_path.exists():
|
||||
return False, "Action file no longer exists in pending."
|
||||
|
||||
target_dir = self.approved_dir if target_status == "approved" else self.rejected_dir
|
||||
target_dir.mkdir(parents=True, exist_ok=True)
|
||||
target_path = target_dir / f"{action_id}.json"
|
||||
|
||||
try:
|
||||
data = json.loads(source_path.read_text())
|
||||
current_status = data.get("status", "pending")
|
||||
|
||||
# Update data
|
||||
data["status"] = target_status
|
||||
data["updated_at"] = time.time()
|
||||
|
||||
history = data.get("transition_history", [])
|
||||
history.append({
|
||||
"from": current_status,
|
||||
"to": target_status,
|
||||
"timestamp": time.time(),
|
||||
"by": f"tg:{username}"
|
||||
})
|
||||
data["transition_history"] = history
|
||||
|
||||
# Atomic move: write to new location, then delete old
|
||||
target_path.write_text(json.dumps(data, indent=2))
|
||||
source_path.unlink()
|
||||
logger.info(f"Action {action_id} moved from {current_status} to {target_status} by {username}")
|
||||
return True, "Success"
|
||||
except Exception as e:
|
||||
logger.error(f"Error moving action file: {e}")
|
||||
return False, str(e)
|
||||
|
||||
async def start_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
||||
"""Simple start command to help users find their ID."""
|
||||
user = update.effective_user
|
||||
message = (
|
||||
f"Hello {user.first_name}! 🤖\n"
|
||||
f"Your Telegram User ID is: `{user.id}`\n\n"
|
||||
)
|
||||
if user.id in ALLOWED_IDS:
|
||||
message += "✅ You are authorized to manage the homelab.\n\n"
|
||||
message += "Use /help to see available commands."
|
||||
else:
|
||||
message += "❌ You are NOT authorized. Add your ID to `TELEGRAM_ALLOWED_USER_IDS`."
|
||||
|
||||
await update.message.reply_text(message, parse_mode="Markdown")
|
||||
|
||||
async def status_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
||||
if update.effective_user.id not in ALLOWED_IDS: return
|
||||
res = await fetch_api("/summary")
|
||||
status = "✅ Online" if res else "❌ Unreachable"
|
||||
message = (
|
||||
f"🤖 *Telegram Bot Status*\n"
|
||||
f"Control Plane API: {status}\n"
|
||||
f"Target URL: `{CONTROL_PLANE_URL}`\n"
|
||||
)
|
||||
await update.message.reply_text(message, parse_mode="Markdown")
|
||||
|
||||
async def summary_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
||||
if update.effective_user.id not in ALLOWED_IDS: return
|
||||
data = await fetch_api("/summary")
|
||||
if not data:
|
||||
await update.message.reply_text("❌ Failed to fetch summary from Control Plane.")
|
||||
return
|
||||
|
||||
msg = "📊 *System Summary*\n"
|
||||
msg += f"Status: `{data.get('status', 'unknown')}`\n"
|
||||
msg += f"Nodes: {data.get('node_count', 0)}\n"
|
||||
msg += f"Services: {data.get('service_count', 0)}\n"
|
||||
msg += f"Active Incidents: {data.get('active_incidents_count', 0)}\n"
|
||||
if data.get('stale'):
|
||||
msg += "\n⚠️ *Warning: Data is stale!*"
|
||||
|
||||
await update.message.reply_text(msg, parse_mode="Markdown")
|
||||
|
||||
async def nodes_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
||||
if update.effective_user.id not in ALLOWED_IDS: return
|
||||
nodes = await fetch_api("/nodes")
|
||||
if nodes is None:
|
||||
await update.message.reply_text("❌ Failed to fetch nodes.")
|
||||
return
|
||||
|
||||
if not nodes:
|
||||
await update.message.reply_text("No nodes discovered in the fleet.")
|
||||
return
|
||||
|
||||
msg = "🖥️ *Nodes Status*\n"
|
||||
for node in nodes:
|
||||
health_icon = "✅" if node.get('health') == 'nominal' else "⚠️" if node.get('health') == 'degraded' else "❌"
|
||||
msg += f"{health_icon} *{node.get('hostname')}*: `{node.get('status', 'unknown')}`\n"
|
||||
msg += f" Last seen: {node.get('last_seen', 'N/A')}\n"
|
||||
|
||||
await update.message.reply_text(msg, parse_mode="Markdown")
|
||||
|
||||
async def services_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
||||
if update.effective_user.id not in ALLOWED_IDS: return
|
||||
services = await fetch_api("/services")
|
||||
if services is None:
|
||||
await update.message.reply_text("❌ Failed to fetch services.")
|
||||
return
|
||||
|
||||
# Summarize by node
|
||||
nodes = {}
|
||||
for s in services:
|
||||
node = s.get("node", "unknown")
|
||||
if node not in nodes: nodes[node] = []
|
||||
nodes[node].append(s)
|
||||
|
||||
msg = "⚙️ *Services Summary*\n"
|
||||
if not nodes:
|
||||
msg += "No services discovered."
|
||||
else:
|
||||
for node, svc_list in sorted(nodes.items()):
|
||||
nominal = len([s for s in svc_list if s.get("health") == "nominal"])
|
||||
msg += f"• *{node}*: {nominal}/{len(svc_list)} nominal\n"
|
||||
|
||||
msg += "\nUse /unhealthy to see issues."
|
||||
await update.message.reply_text(msg, parse_mode="Markdown")
|
||||
|
||||
async def unhealthy_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
||||
if update.effective_user.id not in ALLOWED_IDS: return
|
||||
services = await fetch_api("/services")
|
||||
nodes = await fetch_api("/nodes")
|
||||
|
||||
msg = "⚠️ *Unhealthy Components*\n"
|
||||
found = False
|
||||
|
||||
if services:
|
||||
for s in services:
|
||||
health = s.get("health", "").lower()
|
||||
if health != "nominal":
|
||||
msg += f"• Service *{s.get('name')}* on *{s.get('node')}*: `{health}`\n"
|
||||
found = True
|
||||
|
||||
if nodes:
|
||||
for n in nodes:
|
||||
checks = n.get("checks", {})
|
||||
if isinstance(checks, str):
|
||||
try: checks = json.loads(checks)
|
||||
except: checks = {}
|
||||
|
||||
docker = checks.get("docker", {})
|
||||
if docker.get("status") == "ok":
|
||||
for c in docker.get("containers", []):
|
||||
if c.get("state") != "running":
|
||||
msg += f"• Container *{c.get('name')}* on *{n.get('hostname')}*: `{c.get('state')}`\n"
|
||||
found = True
|
||||
|
||||
if not found:
|
||||
msg += "All systems nominal. ✅"
|
||||
|
||||
await update.message.reply_text(msg, parse_mode="Markdown")
|
||||
|
||||
async def incidents_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
||||
if update.effective_user.id not in ALLOWED_IDS: return
|
||||
incidents = await fetch_api("/incidents")
|
||||
if incidents is None:
|
||||
await update.message.reply_text("❌ Failed to fetch incidents.")
|
||||
return
|
||||
|
||||
active = [i for i in incidents if i.get("status") not in ("resolved", "closed")]
|
||||
if not active:
|
||||
await update.message.reply_text("No active incidents. ✅")
|
||||
return
|
||||
|
||||
msg = "🚨 *Active Incidents*\n"
|
||||
for inc in active:
|
||||
severity = inc.get('severity', 'info').upper()
|
||||
msg += f"• [{severity}] *{inc.get('type')}*: {inc.get('message')}\n"
|
||||
|
||||
await update.message.reply_text(msg, parse_mode="Markdown")
|
||||
|
||||
async def actions_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
||||
if update.effective_user.id not in ALLOWED_IDS: return
|
||||
actions = await fetch_api("/actions")
|
||||
if actions is None:
|
||||
await update.message.reply_text("❌ Actions endpoint unavailable.")
|
||||
return
|
||||
|
||||
msg = "⚡ *Actions Summary*\n"
|
||||
total = 0
|
||||
for status, act_list in actions.items():
|
||||
if act_list:
|
||||
msg += f"• {status.capitalize()}: {len(act_list)}\n"
|
||||
total += len(act_list)
|
||||
|
||||
if total == 0:
|
||||
msg = "No actions recorded."
|
||||
|
||||
await update.message.reply_text(msg, parse_mode="Markdown")
|
||||
|
||||
async def help_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
||||
msg = (
|
||||
"📖 *Supported Commands*\n\n"
|
||||
"/status - Check bot and API connectivity\n"
|
||||
"/summary - System health overview\n"
|
||||
"/nodes - List homelab nodes and their status\n"
|
||||
"/services - Summary of services across nodes\n"
|
||||
"/unhealthy - List all unhealthy components\n"
|
||||
"/incidents - View active incidents\n"
|
||||
"/actions - Summary of operator actions\n"
|
||||
"/help - Show this help message\n\n"
|
||||
"Free text will be handled by the guidance system."
|
||||
)
|
||||
await update.message.reply_text(msg, parse_mode="Markdown")
|
||||
|
||||
async def handle_fallback(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
||||
"""Handles non-command messages."""
|
||||
if update.effective_user.id not in ALLOWED_IDS: return
|
||||
|
||||
if ENABLE_LLM_FALLBACK and OPENCLAW_BASE_URL:
|
||||
# Placeholder for OpenClaw LLM fallback
|
||||
# In a real scenario, this would call the LLM API
|
||||
logger.info(f"LLM fallback requested for: {update.message.text}")
|
||||
|
||||
await update.message.reply_text(
|
||||
"Use /summary, /nodes, /services, /unhealthy, /incidents, /actions."
|
||||
)
|
||||
|
||||
async def run_bot():
|
||||
if not TOKEN:
|
||||
print("CRITICAL: TELEGRAM_BOT_TOKEN is not set. Telegram bot will not start.")
|
||||
# Keep process alive to not crash compose if not desired, but here we just exit
|
||||
# Requirement says: "do not fail if Telegram token is absent, but telegram-bot should be disabled or exit cleanly"
|
||||
return
|
||||
|
||||
bot_logic = ApprovalBot()
|
||||
|
||||
application = ApplicationBuilder().token(TOKEN).build()
|
||||
|
||||
application.add_handler(CommandHandler("start", start_command))
|
||||
application.add_handler(CommandHandler("status", status_command))
|
||||
application.add_handler(CommandHandler("summary", summary_command))
|
||||
application.add_handler(CommandHandler("nodes", nodes_command))
|
||||
application.add_handler(CommandHandler("services", services_command))
|
||||
application.add_handler(CommandHandler("unhealthy", unhealthy_command))
|
||||
application.add_handler(CommandHandler("incidents", incidents_command))
|
||||
application.add_handler(CommandHandler("actions", actions_command))
|
||||
application.add_handler(CommandHandler("help", help_command))
|
||||
|
||||
application.add_handler(MessageHandler(filters.TEXT & (~filters.COMMAND), handle_fallback))
|
||||
application.add_handler(CallbackQueryHandler(bot_logic.handle_callback))
|
||||
|
||||
# Schedule the pending actions check
|
||||
job_queue = application.job_queue
|
||||
if job_queue:
|
||||
job_queue.run_repeating(bot_logic.check_pending_actions, interval=10, first=5)
|
||||
else:
|
||||
logger.warning("JobQueue is not available. Periodic pending actions check will be skipped.")
|
||||
|
||||
logger.info("Starting Telegram Approval Bot...")
|
||||
await application.initialize()
|
||||
await application.start()
|
||||
await application.updater.start_polling()
|
||||
|
||||
# Run until the application is stopped
|
||||
stop_event = asyncio.Event()
|
||||
try:
|
||||
await stop_event.wait()
|
||||
except (KeyboardInterrupt, SystemExit):
|
||||
logger.info("Stopping bot...")
|
||||
finally:
|
||||
await application.stop()
|
||||
await application.shutdown()
|
||||
|
||||
if __name__ == "__main__":
|
||||
try:
|
||||
asyncio.run(run_bot())
|
||||
except KeyboardInterrupt:
|
||||
pass
|
||||
except Exception as e:
|
||||
logger.error(f"Fatal error: {e}")
|
||||
1
services/agent-system/telegram-bot/requirements.txt
Normal file
1
services/agent-system/telegram-bot/requirements.txt
Normal file
|
|
@ -0,0 +1 @@
|
|||
python-telegram-bot[job-queue]==20.7
|
||||
38
services/agent-system/telegram-bot/tests/conftest.py
Normal file
38
services/agent-system/telegram-bot/tests/conftest.py
Normal file
|
|
@ -0,0 +1,38 @@
|
|||
"""Stub telegram before bot.py is imported so pytest doesn't need the real package."""
|
||||
from __future__ import annotations
|
||||
|
||||
import sys
|
||||
import types
|
||||
from unittest.mock import MagicMock
|
||||
|
||||
|
||||
def _make_telegram_stub() -> types.ModuleType:
|
||||
mod = types.ModuleType("telegram")
|
||||
mod.Update = MagicMock
|
||||
mod.InlineKeyboardButton = MagicMock
|
||||
mod.InlineKeyboardMarkup = MagicMock
|
||||
return mod
|
||||
|
||||
|
||||
def _make_telegram_ext_stub() -> types.ModuleType:
|
||||
mod = types.ModuleType("telegram.ext")
|
||||
mod.ApplicationBuilder = MagicMock
|
||||
|
||||
# ContextTypes.DEFAULT_TYPE is referenced as a type annotation at class-body
|
||||
# evaluation time, so it must be a real attribute, not a dynamic MagicMock attr.
|
||||
ContextTypesMock = MagicMock()
|
||||
ContextTypesMock.DEFAULT_TYPE = type(None)
|
||||
mod.ContextTypes = ContextTypesMock
|
||||
|
||||
mod.CommandHandler = MagicMock
|
||||
mod.CallbackQueryHandler = MagicMock
|
||||
mod.MessageHandler = MagicMock
|
||||
mod.filters = MagicMock()
|
||||
return mod
|
||||
|
||||
|
||||
# Insert before any import of bot.py
|
||||
if "telegram" not in sys.modules:
|
||||
sys.modules["telegram"] = _make_telegram_stub()
|
||||
if "telegram.ext" not in sys.modules:
|
||||
sys.modules["telegram.ext"] = _make_telegram_ext_stub()
|
||||
116
services/agent-system/telegram-bot/tests/test_format.py
Normal file
116
services/agent-system/telegram-bot/tests/test_format.py
Normal file
|
|
@ -0,0 +1,116 @@
|
|||
"""Tests for _format_pending_action — no Telegram connection required.
|
||||
|
||||
telegram stubs are set up in conftest.py before this module is imported.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent.parent))
|
||||
from bot import _format_pending_action
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Bug 1 — risk_level field
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def test_risk_level_shown_when_present():
|
||||
data = {
|
||||
"type": "container_restart", "service": "homeassistant",
|
||||
"node": "chelsty-ha", "risk_level": "low",
|
||||
}
|
||||
msg = _format_pending_action("container-restart-chelsty-ha-homeassistant", data)
|
||||
assert "Risk: *low*" in msg
|
||||
assert "unknown" not in msg
|
||||
|
||||
|
||||
def test_risk_falls_back_to_legacy_risk_key():
|
||||
data = {
|
||||
"type": "redeploy", "service": "mosquitto",
|
||||
"node": "chelsty-infra", "risk": "guarded",
|
||||
}
|
||||
msg = _format_pending_action("redeploy-chelsty-infra-mosquitto", data)
|
||||
assert "Risk: *guarded*" in msg
|
||||
|
||||
|
||||
def test_risk_unknown_when_both_absent():
|
||||
data = {"type": "redeploy", "service": "foo", "node": "bar"}
|
||||
msg = _format_pending_action("redeploy-bar-foo", data)
|
||||
assert "Risk: *unknown*" in msg
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Bug 2 — description field
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def test_description_shown_for_alert_only():
|
||||
data = {
|
||||
"type": "alert_only", "service": "homeassistant",
|
||||
"node": "chelsty-ha", "risk_level": "info",
|
||||
"description": "3 entities unavailable for >1h",
|
||||
}
|
||||
msg = _format_pending_action("alert-ha-entity-unavailable-chelsty-ha", data)
|
||||
assert "3 entities unavailable for >1h" in msg
|
||||
assert "Description:" in msg
|
||||
|
||||
|
||||
def test_description_shown_for_container_restart():
|
||||
data = {
|
||||
"type": "container_restart", "service": "homeassistant",
|
||||
"node": "chelsty-ha", "risk_level": "low",
|
||||
"description": "Restart 'homeassistant' on chelsty-ha: HA WebSocket unresponsive",
|
||||
}
|
||||
msg = _format_pending_action("container-restart-chelsty-ha-homeassistant", data)
|
||||
assert "HA WebSocket unresponsive" in msg
|
||||
|
||||
|
||||
def test_description_absent_no_crash():
|
||||
data = {"type": "redeploy", "service": "foo", "node": "bar", "risk_level": "guarded"}
|
||||
msg = _format_pending_action("redeploy-bar-foo", data)
|
||||
assert "Description:" not in msg
|
||||
assert "Risk: *guarded*" in msg
|
||||
|
||||
|
||||
def test_description_truncated_at_300_chars():
|
||||
long_desc = "x" * 400
|
||||
data = {
|
||||
"type": "alert_only", "service": "homeassistant",
|
||||
"node": "chelsty-ha", "risk_level": "info",
|
||||
"description": long_desc,
|
||||
}
|
||||
msg = _format_pending_action("alert-ha-foo-chelsty-ha", data)
|
||||
assert "x" * 300 in msg
|
||||
assert "..." in msg
|
||||
assert "x" * 301 not in msg
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Combined — real HA alert_only action shape
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def test_ha_alert_only_full_action():
|
||||
"""Mirrors an actual alert_only action written by supervisor._generate_ha_alert_only."""
|
||||
data = {
|
||||
"action_id": "alert-ha-entity-unavailable-chelsty-ha",
|
||||
"type": "alert_only",
|
||||
"node": "chelsty-ha",
|
||||
"service": "homeassistant",
|
||||
"risk_level": "info",
|
||||
"confidence": 1.0,
|
||||
"description": "3 entities unavailable for >1h: sensor.power, binary_sensor.window",
|
||||
"status": "pending",
|
||||
"payload": {
|
||||
"location_tag": "chelsty",
|
||||
"reason": "ha_entity_unavailable_long",
|
||||
"count": 3,
|
||||
},
|
||||
}
|
||||
msg = _format_pending_action(data["action_id"], data)
|
||||
assert "alert_only" in msg
|
||||
assert "chelsty-ha" in msg
|
||||
assert "Risk: *info*" in msg
|
||||
assert "3 entities unavailable" in msg
|
||||
assert "unknown" not in msg
|
||||
7
services/agent-system/webui/Dockerfile
Normal file
7
services/agent-system/webui/Dockerfile
Normal file
|
|
@ -0,0 +1,7 @@
|
|||
FROM python:3.11-slim
|
||||
|
||||
WORKDIR /app
|
||||
COPY web.py index.html ./
|
||||
|
||||
EXPOSE 8080
|
||||
CMD ["python", "web.py"]
|
||||
769
services/agent-system/webui/index.html
Normal file
769
services/agent-system/webui/index.html
Normal file
|
|
@ -0,0 +1,769 @@
|
|||
<!doctype html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="utf-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1">
|
||||
<title>Operator Control Plane</title>
|
||||
<style>
|
||||
:root {
|
||||
--bg-color: #0a0c0e;
|
||||
--sidebar-color: #14171a;
|
||||
--card-color: #1c2024;
|
||||
--border-color: #2a3540;
|
||||
--text-color: #e7edf3;
|
||||
--text-muted: #94a3b8;
|
||||
--accent-color: #3eaf7c;
|
||||
--nominal: #3eaf7c;
|
||||
--degraded: #e7c000;
|
||||
--unstable: #e67e22;
|
||||
--reconciling: #3498db;
|
||||
--error: #c0392b;
|
||||
--safe: #3eaf7c;
|
||||
--guarded: #e67e22;
|
||||
--dangerous: #c0392b;
|
||||
}
|
||||
|
||||
body {
|
||||
margin: 0;
|
||||
font-family: 'Inter', system-ui, -apple-system, sans-serif;
|
||||
background: var(--bg-color);
|
||||
color: var(--text-color);
|
||||
display: flex;
|
||||
height: 100vh;
|
||||
overflow: hidden;
|
||||
}
|
||||
|
||||
/* Sidebar */
|
||||
.sidebar {
|
||||
width: 240px;
|
||||
background: var(--sidebar-color);
|
||||
border-right: 1px solid var(--border-color);
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
flex-shrink: 0;
|
||||
}
|
||||
|
||||
.sidebar-header {
|
||||
padding: 24px;
|
||||
font-weight: 800;
|
||||
font-size: 14px;
|
||||
letter-spacing: 0.1em;
|
||||
color: var(--accent-color);
|
||||
border-bottom: 1px solid var(--border-color);
|
||||
}
|
||||
|
||||
.nav-list {
|
||||
list-style: none;
|
||||
padding: 12px 0;
|
||||
margin: 0;
|
||||
flex-grow: 1;
|
||||
}
|
||||
|
||||
.nav-item {
|
||||
padding: 12px 24px;
|
||||
cursor: pointer;
|
||||
font-size: 14px;
|
||||
color: var(--text-muted);
|
||||
transition: all 0.2s;
|
||||
display: flex;
|
||||
align-items: center;
|
||||
gap: 12px;
|
||||
}
|
||||
|
||||
.nav-item:hover {
|
||||
background: rgba(255, 255, 255, 0.05);
|
||||
color: var(--text-color);
|
||||
}
|
||||
|
||||
.nav-item.active {
|
||||
background: rgba(62, 175, 124, 0.1);
|
||||
color: var(--accent-color);
|
||||
border-left: 3px solid var(--accent-color);
|
||||
}
|
||||
|
||||
.sidebar-footer {
|
||||
padding: 16px;
|
||||
border-top: 1px solid var(--border-color);
|
||||
font-size: 12px;
|
||||
}
|
||||
|
||||
/* Content Area */
|
||||
.main-content {
|
||||
flex-grow: 1;
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
overflow: hidden;
|
||||
}
|
||||
|
||||
header {
|
||||
height: 64px;
|
||||
border-bottom: 1px solid var(--border-color);
|
||||
display: flex;
|
||||
align-items: center;
|
||||
padding: 0 24px;
|
||||
justify-content: space-between;
|
||||
background: var(--bg-color);
|
||||
}
|
||||
|
||||
.view-title {
|
||||
font-size: 18px;
|
||||
font-weight: 600;
|
||||
}
|
||||
|
||||
.content-scroll {
|
||||
flex-grow: 1;
|
||||
overflow-y: auto;
|
||||
padding: 24px;
|
||||
}
|
||||
|
||||
/* Cards & Grids */
|
||||
.grid {
|
||||
display: grid;
|
||||
grid-template-columns: repeat(auto-fill, minmax(350px, 1fr));
|
||||
gap: 20px;
|
||||
}
|
||||
|
||||
.card {
|
||||
background: var(--card-color);
|
||||
border: 1px solid var(--border-color);
|
||||
padding: 20px;
|
||||
border-radius: 4px;
|
||||
position: relative;
|
||||
}
|
||||
|
||||
.card-header {
|
||||
display: flex;
|
||||
justify-content: space-between;
|
||||
align-items: center;
|
||||
margin-bottom: 16px;
|
||||
}
|
||||
|
||||
.card-title {
|
||||
font-weight: 700;
|
||||
font-size: 16px;
|
||||
}
|
||||
|
||||
/* Status Badges */
|
||||
.badge {
|
||||
padding: 4px 8px;
|
||||
border-radius: 4px;
|
||||
font-size: 11px;
|
||||
font-weight: 700;
|
||||
text-transform: uppercase;
|
||||
}
|
||||
|
||||
.status-nominal { background: rgba(62, 175, 124, 0.1); color: var(--nominal); }
|
||||
.status-degraded { background: rgba(231, 192, 0, 0.1); color: var(--degraded); }
|
||||
.status-unstable { background: rgba(230, 126, 34, 0.1); color: var(--unstable); }
|
||||
.status-reconciling { background: rgba(52, 152, 219, 0.1); color: var(--reconciling); }
|
||||
.status-error { background: rgba(192, 57, 43, 0.1); color: var(--error); }
|
||||
|
||||
/* Timeline */
|
||||
.timeline {
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
gap: 12px;
|
||||
}
|
||||
|
||||
.event {
|
||||
padding: 12px;
|
||||
border-left: 2px solid var(--border-color);
|
||||
background: rgba(255, 255, 255, 0.02);
|
||||
font-family: ui-monospace, monospace;
|
||||
font-size: 13px;
|
||||
}
|
||||
|
||||
.event.high { border-left-color: var(--error); }
|
||||
.event.medium { border-left-color: var(--unstable); }
|
||||
.event.low { border-left-color: var(--nominal); }
|
||||
|
||||
.event-header {
|
||||
display: flex;
|
||||
justify-content: space-between;
|
||||
margin-bottom: 4px;
|
||||
color: var(--text-muted);
|
||||
}
|
||||
|
||||
/* Forms & Inputs */
|
||||
.controls {
|
||||
display: flex;
|
||||
gap: 12px;
|
||||
margin-top: 20px;
|
||||
}
|
||||
|
||||
input, button {
|
||||
background: var(--card-color);
|
||||
border: 1px solid var(--border-color);
|
||||
color: var(--text-color);
|
||||
padding: 8px 16px;
|
||||
font-size: 14px;
|
||||
border-radius: 4px;
|
||||
}
|
||||
|
||||
button {
|
||||
cursor: pointer;
|
||||
font-weight: 600;
|
||||
}
|
||||
|
||||
button:hover { background: var(--border-color); }
|
||||
|
||||
.btn-primary { background: var(--accent-color); color: white; border: none; }
|
||||
.btn-primary:hover { background: #359b6d; }
|
||||
|
||||
/* Utility */
|
||||
.hidden { display: none !important; }
|
||||
.mono { font-family: ui-monospace, monospace; }
|
||||
.label { color: var(--text-muted); font-size: 12px; margin-bottom: 4px; }
|
||||
.value { font-weight: 500; margin-bottom: 12px; }
|
||||
|
||||
.risk-safe { background: rgba(62, 175, 124, 0.1); color: var(--safe); }
|
||||
.risk-guarded { background: rgba(230, 126, 34, 0.1); color: var(--guarded); }
|
||||
.risk-dangerous { background: rgba(192, 57, 43, 0.1); color: var(--dangerous); }
|
||||
|
||||
</style>
|
||||
</head>
|
||||
<body>
|
||||
<aside class="sidebar">
|
||||
<div class="sidebar-header">HOMELAB OPERATOR</div>
|
||||
<ul class="nav-list">
|
||||
<li class="nav-item active" onclick="showView('dashboard', this)">
|
||||
<span>Dashboard</span>
|
||||
</li>
|
||||
<li class="nav-item" onclick="showView('actions', this)">
|
||||
<span>Action Queue</span>
|
||||
</li>
|
||||
<li class="nav-item" onclick="showView('nodes', this)">
|
||||
<span>Nodes</span>
|
||||
</li>
|
||||
<li class="nav-item" onclick="showView('services', this)">
|
||||
<span>Services</span>
|
||||
</li>
|
||||
<li class="nav-item" onclick="showView('deployments', this)">
|
||||
<span>Deployments</span>
|
||||
</li>
|
||||
<li class="nav-item" onclick="showView('topology', this)">
|
||||
<span>Topology</span>
|
||||
</li>
|
||||
<li class="nav-item" onclick="showView('events', this)">
|
||||
<span>Events</span>
|
||||
</li>
|
||||
<li class="nav-item" onclick="showView('correlation', this)">
|
||||
<span>Correlation</span>
|
||||
</li>
|
||||
<li class="nav-item" onclick="showView('recommendations', this)">
|
||||
<span>Recommendations</span>
|
||||
</li>
|
||||
<li class="nav-item" onclick="showView('settings', this)">
|
||||
<span>Settings</span>
|
||||
</li>
|
||||
</ul>
|
||||
<div class="sidebar-footer">
|
||||
<div id="summary-status">System Status: Loading...</div>
|
||||
</div>
|
||||
</aside>
|
||||
|
||||
<main class="main-content">
|
||||
<div id="stale-banner" class="hidden" style="background:var(--error); color:white; padding:8px 24px; font-weight:bold; font-size:12px; text-align:center; letter-spacing:0.05em">
|
||||
RUNTIME STATE IS STALE
|
||||
</div>
|
||||
<header>
|
||||
<div style="display:flex; align-items:center; gap:20px">
|
||||
<div class="view-title" id="current-view-title">Dashboard</div>
|
||||
<select id="operator-mode" onchange="setOperatorMode(this.value)" style="background:var(--sidebar-color); border:1px solid var(--border-color); color:var(--accent-color); font-weight:bold; font-size:12px; padding:4px 8px">
|
||||
<option value="observe">OBSERVE</option>
|
||||
<option value="recommend">RECOMMEND</option>
|
||||
<option value="approval" selected>APPROVAL</option>
|
||||
<option value="autonomous">AUTONOMOUS</option>
|
||||
<option value="maintenance">MAINTENANCE</option>
|
||||
</select>
|
||||
</div>
|
||||
<div class="header-actions" style="display:flex; gap:8px; align-items:center">
|
||||
<button onclick="refreshData()">Refresh</button>
|
||||
<button id="copy-ai-btn" onclick="copyForAI()">Copy for AI</button>
|
||||
</div>
|
||||
</header>
|
||||
|
||||
<div class="content-scroll">
|
||||
<!-- Dashboard View -->
|
||||
<div id="view-dashboard" class="view">
|
||||
<div class="grid">
|
||||
<div class="card">
|
||||
<div class="card-title">System Overview</div>
|
||||
<div id="dashboard-summary" style="margin-top:20px"></div>
|
||||
</div>
|
||||
<div class="card">
|
||||
<div class="card-title">Pending Actions</div>
|
||||
<div id="dashboard-actions-summary" style="margin-top:20px"></div>
|
||||
</div>
|
||||
<div class="card">
|
||||
<div class="card-title">Active Incidents</div>
|
||||
<div id="dashboard-incidents" style="margin-top:20px"></div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Actions View -->
|
||||
<div id="view-actions" class="view hidden">
|
||||
<div style="display:grid; grid-template-columns: 1fr 1fr; gap:24px">
|
||||
<div>
|
||||
<h3>Pending Approval</h3>
|
||||
<div id="actions-pending" class="timeline"></div>
|
||||
</div>
|
||||
<div>
|
||||
<h3>Active / History</h3>
|
||||
<div id="actions-history" class="timeline"></div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Nodes View -->
|
||||
<div id="view-nodes" class="view hidden">
|
||||
<div class="grid" id="nodes-list"></div>
|
||||
</div>
|
||||
|
||||
<!-- Services View -->
|
||||
<div id="view-services" class="view hidden">
|
||||
<div class="grid" id="services-list"></div>
|
||||
</div>
|
||||
|
||||
<!-- Deployments View -->
|
||||
<div id="view-deployments" class="view hidden">
|
||||
<div class="grid" id="deployments-list"></div>
|
||||
</div>
|
||||
|
||||
<!-- Topology View -->
|
||||
<div id="view-topology" class="view hidden">
|
||||
<div class="card" style="min-height:500px">
|
||||
<div class="card-title">Runtime Topology</div>
|
||||
<div id="topology-map" style="margin-top:20px; display:flex; flex-wrap:wrap; gap:40px; justify-content:center"></div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Events View -->
|
||||
<div id="view-events" class="view hidden">
|
||||
<div class="timeline" id="events-timeline"></div>
|
||||
</div>
|
||||
|
||||
<!-- Correlation View -->
|
||||
<div id="view-correlation" class="view hidden">
|
||||
<div id="correlation-chains" class="grid"></div>
|
||||
</div>
|
||||
|
||||
<!-- Recommendations View -->
|
||||
<div id="view-recommendations" class="view hidden">
|
||||
<div class="grid" id="recommendations-list"></div>
|
||||
</div>
|
||||
|
||||
<!-- Settings View -->
|
||||
<div id="view-settings" class="view hidden">
|
||||
<div class="card">
|
||||
<div class="card-title">Configuration</div>
|
||||
<div id="settings-content" style="margin-top:20px"></div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</main>
|
||||
|
||||
<script>
|
||||
let currentView = 'dashboard';
|
||||
const pollInterval = 5000;
|
||||
|
||||
function showView(viewId, el) {
|
||||
document.querySelectorAll('.view').forEach(v => v.classList.add('hidden'));
|
||||
document.getElementById('view-' + viewId).classList.remove('hidden');
|
||||
document.querySelectorAll('.nav-item').forEach(i => i.classList.remove('active'));
|
||||
if (el) el.classList.add('active');
|
||||
currentView = viewId;
|
||||
document.getElementById('current-view-title').textContent = viewId.charAt(0).toUpperCase() + viewId.slice(1);
|
||||
refreshData();
|
||||
}
|
||||
|
||||
async function fetchData(endpoint) {
|
||||
try {
|
||||
const res = await fetch(endpoint, {cache: 'no-store'});
|
||||
return await res.json();
|
||||
} catch (e) {
|
||||
console.error('Fetch error:', endpoint, e);
|
||||
return null;
|
||||
}
|
||||
}
|
||||
|
||||
async function postData(endpoint, data) {
|
||||
try {
|
||||
const res = await fetch(endpoint, {
|
||||
method: 'POST',
|
||||
headers: {'Content-Type': 'application/json'},
|
||||
body: JSON.stringify(data)
|
||||
});
|
||||
return await res.json();
|
||||
} catch (e) {
|
||||
console.error('Post error:', endpoint, e);
|
||||
return null;
|
||||
}
|
||||
}
|
||||
|
||||
async function mutateAction(id, status) {
|
||||
const res = await postData('/action/mutate', {id, status});
|
||||
if (res && res.status === 'ok') {
|
||||
refreshData();
|
||||
} else {
|
||||
alert('Mutation failed');
|
||||
}
|
||||
}
|
||||
|
||||
async function setOperatorMode(mode) {
|
||||
console.log('Operator mode set to:', mode);
|
||||
const res = await postData('/mode', {mode});
|
||||
if (res && res.status === 'ok') {
|
||||
console.log('Mode updated successfully');
|
||||
}
|
||||
}
|
||||
|
||||
function formatTime(ts) {
|
||||
if (!ts) return 'N/A';
|
||||
return new Date(ts * 1000).toLocaleString();
|
||||
}
|
||||
|
||||
function getStatusClass(status) {
|
||||
status = (status || '').toLowerCase();
|
||||
if (['nominal', 'healthy', 'ok', 'up'].includes(status)) return 'status-nominal';
|
||||
if (['degraded', 'warning'].includes(status)) return 'status-degraded';
|
||||
if (['unstable'].includes(status)) return 'status-unstable';
|
||||
if (['reconciling'].includes(status)) return 'status-reconciling';
|
||||
if (['error', 'down', 'failed'].includes(status)) return 'status-error';
|
||||
return '';
|
||||
}
|
||||
|
||||
async function refreshData() {
|
||||
// Refresh summary always
|
||||
const summary = await fetchData('/summary');
|
||||
if (summary) {
|
||||
const statusEl = document.getElementById('summary-status');
|
||||
statusEl.textContent = `System Status: ${summary.status.toUpperCase()}`;
|
||||
statusEl.className = 'sidebar-footer ' + getStatusClass(summary.status);
|
||||
|
||||
// Handle stale state
|
||||
const staleBanner = document.getElementById('stale-banner');
|
||||
if (summary.stale) {
|
||||
staleBanner.classList.remove('hidden');
|
||||
staleBanner.textContent = `CRITICAL: Runtime state is STALE (Last update: ${formatTime(summary.last_update)})`;
|
||||
} else {
|
||||
staleBanner.classList.add('hidden');
|
||||
}
|
||||
|
||||
if (currentView === 'dashboard') {
|
||||
const dashSummary = document.getElementById('dashboard-summary');
|
||||
dashSummary.innerHTML = `
|
||||
<div class="label">Nodes</div><div class="value">${summary.node_count}</div>
|
||||
<div class="label">Services</div><div class="value">${summary.service_count}</div>
|
||||
<div class="label">Last Update</div><div class="value">${formatTime(summary.last_update)}</div>
|
||||
`;
|
||||
}
|
||||
}
|
||||
|
||||
if (currentView === 'dashboard' || currentView === 'actions') {
|
||||
const actions = await fetchData('/actions');
|
||||
if (actions) {
|
||||
if (currentView === 'dashboard') {
|
||||
const dashActions = document.getElementById('dashboard-actions-summary');
|
||||
const pendingCount = actions.pending.length;
|
||||
dashActions.innerHTML = `
|
||||
<div class="label">Pending</div><div class="value" style="color:var(--guarded)">${pendingCount}</div>
|
||||
<div class="label">Running</div><div class="value" style="color:var(--reconciling)">${actions.running.length}</div>
|
||||
`;
|
||||
}
|
||||
if (currentView === 'actions') {
|
||||
const pendingEl = document.getElementById('actions-pending');
|
||||
const historyEl = document.getElementById('actions-history');
|
||||
|
||||
pendingEl.innerHTML = actions.pending.map(a => `
|
||||
<div class="card" style="margin-bottom:12px">
|
||||
<div class="card-header">
|
||||
<div class="card-title">${(a.action_type || a.type || 'unknown').toUpperCase()}</div>
|
||||
<span class="badge risk-${a.risk_level}">${a.risk_level}</span>
|
||||
</div>
|
||||
<p>${a.description || a.action_type || 'No description'}</p>
|
||||
<div class="label">Target</div><div class="value">${a.node || (a.target && a.target.node) || 'unknown'} ${(a.service || (a.target && a.target.service)) || ''}</div>
|
||||
<div class="label">Confidence</div><div class="value">${Math.round((a.confidence || 0)*100)}%</div>
|
||||
<div class="controls">
|
||||
<button class="btn-primary" onclick="mutateAction('${a.id}', 'approved')">Approve</button>
|
||||
<button onclick="mutateAction('${a.id}', 'rejected')">Reject</button>
|
||||
</div>
|
||||
</div>
|
||||
`).join('') || 'No pending actions.';
|
||||
|
||||
const history = [...actions.approved, ...actions.running, ...actions.completed, ...actions.failed, ...actions.rejected];
|
||||
historyEl.innerHTML = history.sort((a,b) => (b.timestamp || b.updated_at || 0) - (a.timestamp || a.updated_at || 0)).map(a => `
|
||||
<div class="event">
|
||||
<div class="event-header">
|
||||
<span>${(a.action_type || a.type || 'unknown').toUpperCase()}</span>
|
||||
<span class="badge ${getStatusClass(a.status)}">${a.status}</span>
|
||||
</div>
|
||||
<div>${a.description || a.action_type || 'No description'}</div>
|
||||
<small>${formatTime(a.timestamp || a.updated_at)} | Target: ${a.node || (a.target && a.target.node)}</small>
|
||||
${a.status === 'approved' ? `<div class="controls"><button class="btn-primary" onclick="mutateAction('${a.id}', 'running')">Execute</button></div>` : ''}
|
||||
${a.transition_history ? `
|
||||
<div style="margin-top:8px; font-size:10px; color:var(--text-muted)">
|
||||
<strong>Trace:</strong> ${a.transition_history.map(h => `${h.from}->${h.to}`).join(' → ')}
|
||||
</div>
|
||||
` : ''}
|
||||
</div>
|
||||
`).join('') || 'No history.';
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if (currentView === 'dashboard' || currentView === 'events') {
|
||||
const incidents = await fetchData('/incidents');
|
||||
if (currentView === 'dashboard') {
|
||||
const dashIncidents = document.getElementById('dashboard-incidents');
|
||||
if (!incidents || incidents.length === 0) {
|
||||
dashIncidents.textContent = 'No active incidents.';
|
||||
} else {
|
||||
dashIncidents.innerHTML = incidents.map(inc => `
|
||||
<div class="event ${inc.severity}">
|
||||
<strong>${inc.severity.toUpperCase()}:</strong> ${inc.message}<br>
|
||||
<small>${formatTime(inc.timestamp)} | Node: ${inc.node}</small>
|
||||
</div>
|
||||
`).join('');
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if (currentView === 'nodes') {
|
||||
const nodes = await fetchData('/nodes');
|
||||
const list = document.getElementById('nodes-list');
|
||||
list.innerHTML = nodes.map(node => `
|
||||
<div class="card">
|
||||
<div class="card-header">
|
||||
<div class="card-title">${node.hostname}</div>
|
||||
<span class="badge ${getStatusClass(node.health)}">${node.health}</span>
|
||||
</div>
|
||||
<div class="label">ID</div><div class="value mono">${node.id}</div>
|
||||
<div class="label">Capabilities</div><div class="value">${node.capabilities.join(', ')}</div>
|
||||
<div class="label">Connectivity</div><div class="value">${node.connectivity}</div>
|
||||
<div class="label">Incidents (24h)</div><div class="value">${node.incidents}</div>
|
||||
<div class="label">Last Seen</div><div class="value">${formatTime(node.last_seen)}</div>
|
||||
<div class="label">Runtime Status</div><div class="value">${node.status}</div>
|
||||
</div>
|
||||
`).join('');
|
||||
}
|
||||
|
||||
if (currentView === 'services') {
|
||||
const services = await fetchData('/services');
|
||||
const list = document.getElementById('services-list');
|
||||
list.innerHTML = services.map(svc => `
|
||||
<div class="card">
|
||||
<div class="card-header">
|
||||
<div class="card-title">${svc.name}</div>
|
||||
<span class="badge ${getStatusClass(svc.health)}">${svc.health}</span>
|
||||
</div>
|
||||
<div class="label">State (Desired/Actual)</div><div class="value">${svc.desired_state} / ${svc.actual_state}</div>
|
||||
<div class="label">Deployment</div><div class="value">${svc.deployment_state}</div>
|
||||
<div class="label">Dependencies</div><div class="value">${svc.dependencies.join(', ') || 'None'}</div>
|
||||
<div class="label">Recommendations</div><div class="value">${svc.recommendations.join(', ') || 'None'}</div>
|
||||
</div>
|
||||
`).join('');
|
||||
}
|
||||
|
||||
if (currentView === 'deployments') {
|
||||
const deps = await fetchData('/deployments');
|
||||
const list = document.getElementById('deployments-list');
|
||||
list.innerHTML = deps.map(dep => `
|
||||
<div class="card">
|
||||
<div class="card-header">
|
||||
<div class="card-title">${dep.service}</div>
|
||||
<span class="badge ${dep.status === 'failed' ? 'status-error' : 'status-reconciling'}">${dep.status}</span>
|
||||
</div>
|
||||
<div class="label">ID</div><div class="value mono">${dep.id}</div>
|
||||
<div class="label">Stage</div><div class="value">${dep.stage}</div>
|
||||
<div class="label">Diagnostics</div><div class="value">${dep.diagnostics || 'No data'}</div>
|
||||
<div class="label">Resumable</div><div class="value">${dep.resumable ? 'Yes' : 'No'}</div>
|
||||
${dep.resumable ? '<button class="btn-primary">Resume</button>' : ''}
|
||||
</div>
|
||||
`).join('');
|
||||
}
|
||||
|
||||
if (currentView === 'events') {
|
||||
const events = await fetchData('/events');
|
||||
const timeline = document.getElementById('events-timeline');
|
||||
timeline.innerHTML = events.map(ev => `
|
||||
<div class="event ${ev.severity}">
|
||||
<div class="event-header">
|
||||
<span>${ev.type.toUpperCase()}</span>
|
||||
<span>${formatTime(ev.timestamp)}</span>
|
||||
</div>
|
||||
<div>${ev.message}</div>
|
||||
<div class="label" style="margin-top:8px">Node: ${ev.node} ${ev.service ? '| Service: ' + ev.service : ''}</div>
|
||||
</div>
|
||||
`).join('');
|
||||
}
|
||||
|
||||
if (currentView === 'recommendations') {
|
||||
const recs = await fetchData('/recommendations');
|
||||
const list = document.getElementById('recommendations-list');
|
||||
list.innerHTML = recs.map(rec => `
|
||||
<div class="card">
|
||||
<div class="card-header">
|
||||
<div class="card-title">${rec.title}</div>
|
||||
<span class="badge risk-${rec.risk_level}">${rec.risk_level}</span>
|
||||
</div>
|
||||
<p>${rec.description}</p>
|
||||
<div class="label">Confidence</div><div class="value">${Math.round(rec.confidence * 100)}%</div>
|
||||
<div class="label">Autonomous Eligible</div><div class="value">${rec.autonomous_eligible ? 'Yes' : 'No'}</div>
|
||||
<div class="label">Blocked Actions</div><div class="value">${rec.blocked_actions.join(', ') || 'None'}</div>
|
||||
<div class="controls">
|
||||
<button class="btn-primary" ${rec.risk_level === 'dangerous' ? 'style="background:var(--dangerous)"' : ''}>Approve Action</button>
|
||||
</div>
|
||||
</div>
|
||||
`).join('');
|
||||
}
|
||||
|
||||
if (currentView === 'topology') {
|
||||
const nodes = await fetchData('/nodes');
|
||||
const services = await fetchData('/services');
|
||||
const topMap = document.getElementById('topology-map');
|
||||
if (nodes && services) {
|
||||
topMap.innerHTML = nodes.map(node => {
|
||||
const nodeServices = services.filter(s => s.node === node.hostname || s.node === node.id);
|
||||
return `
|
||||
<div class="card" style="width:250px; border: 1px solid ${node.health === 'nominal' ? 'var(--border-color)' : 'var(--error)'}">
|
||||
<div class="card-header">
|
||||
<div class="card-title">${node.hostname}</div>
|
||||
<span class="badge ${getStatusClass(node.health)}">${node.health}</span>
|
||||
</div>
|
||||
<div class="label">Capabilities</div>
|
||||
<div class="value" style="font-size:11px">${node.capabilities.join(', ')}</div>
|
||||
<div class="label">Services</div>
|
||||
<div style="font-size:12px; margin-bottom:10px">
|
||||
${nodeServices.length > 0 ? nodeServices.map(s => `
|
||||
<div style="display:flex; justify-content:space-between; margin-bottom:4px; padding:4px; background:rgba(255,255,255,0.03)">
|
||||
<span>${s.name}</span>
|
||||
<span class="${getStatusClass(s.health)}" style="font-size:10px">${s.health}</span>
|
||||
</div>
|
||||
${s.dependencies.length > 0 ? `<div style="font-size:9px; color:var(--text-muted); margin-left:8px; margin-bottom:4px">dep: ${s.dependencies.join(', ')}</div>` : ''}
|
||||
`).join('') : '<div class="value">None</div>'}
|
||||
</div>
|
||||
</div>
|
||||
`;
|
||||
}).join('');
|
||||
}
|
||||
}
|
||||
|
||||
if (currentView === 'correlation') {
|
||||
const incidents = await fetchData('/incidents');
|
||||
const actions = await fetchData('/actions');
|
||||
const list = document.getElementById('correlation-chains');
|
||||
if (incidents && actions) {
|
||||
const allActions = Object.values(actions).flat();
|
||||
list.innerHTML = incidents.map(inc => {
|
||||
const related = allActions.filter(a => a.correlation_chain && a.correlation_chain.includes(inc.id));
|
||||
return `
|
||||
<div class="card">
|
||||
<div class="card-header">
|
||||
<div class="card-title">Incident: ${inc.id || 'INC-001'}</div>
|
||||
<span class="badge status-error">Active</span>
|
||||
</div>
|
||||
<p>${inc.message}</p>
|
||||
<div class="label">Related Actions</div>
|
||||
${related.map(a => `
|
||||
<div class="event" style="margin-top:5px">
|
||||
<strong>${a.type}</strong> (${a.status})<br>
|
||||
<small>${a.description}</small>
|
||||
</div>
|
||||
`).join('') || '<div class="value">No actions yet</div>'}
|
||||
</div>
|
||||
`;
|
||||
}).join('');
|
||||
}
|
||||
}
|
||||
if (currentView === 'settings') {
|
||||
const config = await fetchData('/config');
|
||||
const content = document.getElementById('settings-content');
|
||||
content.innerHTML = `
|
||||
<div class="label">Auto Mode</div>
|
||||
<div class="value">${config.auto_mode ? 'Enabled' : 'Disabled'}</div>
|
||||
<div class="label">Action Thresholds</div>
|
||||
<div class="value mono">${JSON.stringify(config.action_thresholds, null, 2)}</div>
|
||||
<div class="label">Telegram Integration</div>
|
||||
<div class="value" style="color:var(--text-muted)">Ready for mobile approval flows. Hook: /api/v1/telegram/webhook</div>
|
||||
<button onclick="alert('Settings update not implemented in this demo')">Edit Configuration</button>
|
||||
`;
|
||||
}
|
||||
}
|
||||
|
||||
async function copyForAI() {
|
||||
const btn = document.getElementById('copy-ai-btn');
|
||||
const original = btn.textContent;
|
||||
btn.textContent = 'Copying...';
|
||||
btn.disabled = true;
|
||||
|
||||
try {
|
||||
const snap = await fetchData('/snapshot');
|
||||
if (!snap) throw new Error('snapshot fetch failed');
|
||||
|
||||
const now = new Date(snap.timestamp);
|
||||
const dateStr = now.toISOString().slice(0, 16).replace('T', ' ');
|
||||
const lines = [];
|
||||
|
||||
lines.push(`=== HOMELAB SNAPSHOT ${dateStr} ===`);
|
||||
|
||||
if (snap.nodes && snap.nodes.length > 0) {
|
||||
lines.push('NODES: ' + snap.nodes.map(n =>
|
||||
`${(n.hostname || n.id || '?').toUpperCase()} ${(n.health || 'unknown').toUpperCase()}`
|
||||
).join(', '));
|
||||
} else {
|
||||
lines.push('NODES: none');
|
||||
}
|
||||
|
||||
if (snap.non_nominal_services && snap.non_nominal_services.length > 0) {
|
||||
lines.push('ERRORS: ' + snap.non_nominal_services.map(s =>
|
||||
`${s.name} (${s.node}) - ${s.health}`
|
||||
).join(', '));
|
||||
} else {
|
||||
lines.push(`ERRORS: none (${snap.nominal_service_count} nominal)`);
|
||||
}
|
||||
|
||||
const activeIncidents = (snap.incidents || []).filter(i => !['resolved', 'closed'].includes(i.status));
|
||||
if (activeIncidents.length > 0) {
|
||||
lines.push('INCIDENTS: ' + activeIncidents.map(i =>
|
||||
`[${i.severity}] ${i.message} (${i.node})`
|
||||
).join('; '));
|
||||
} else {
|
||||
lines.push('INCIDENTS: none');
|
||||
}
|
||||
|
||||
if (snap.events && snap.events.length > 0) {
|
||||
lines.push(`EVENTS (last ${snap.events.length}):`);
|
||||
snap.events.forEach(ev => {
|
||||
const ts = ev.timestamp
|
||||
? new Date(ev.timestamp * 1000).toISOString().slice(11, 19)
|
||||
: '?';
|
||||
const svc = ev.service ? '/' + ev.service : '';
|
||||
lines.push(` ${ts} [${ev.severity || ev.level || '?'}] ${ev.type} - ${ev.message || ''} (${ev.node || ''}${svc})`);
|
||||
});
|
||||
} else {
|
||||
lines.push('EVENTS (last 10): none');
|
||||
}
|
||||
|
||||
const s = snap.summary || {};
|
||||
lines.push(`SUMMARY: status=${s.status || '?'} nodes=${s.node_count ?? '?'} services=${s.service_count ?? '?'} incidents=${s.incident_count ?? '?'}`);
|
||||
|
||||
await navigator.clipboard.writeText(lines.join('\n'));
|
||||
btn.textContent = 'Copied!';
|
||||
setTimeout(() => { btn.textContent = original; btn.disabled = false; }, 2000);
|
||||
} catch (e) {
|
||||
console.error('copyForAI error:', e);
|
||||
btn.textContent = 'Error';
|
||||
setTimeout(() => { btn.textContent = original; btn.disabled = false; }, 2000);
|
||||
}
|
||||
}
|
||||
|
||||
// Initial load
|
||||
refreshData();
|
||||
// Poll for updates
|
||||
setInterval(refreshData, pollInterval);
|
||||
|
||||
</script>
|
||||
</body>
|
||||
</html>
|
||||
301
services/agent-system/webui/web.py
Normal file
301
services/agent-system/webui/web.py
Normal file
|
|
@ -0,0 +1,301 @@
|
|||
import json
|
||||
import os
|
||||
import time
|
||||
from datetime import datetime, timezone
|
||||
from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
STATE_DIR = Path(os.getenv("HOMELAB_STATE_ROOT", "/opt/homelab/state"))
|
||||
EVENTS_DIR = Path(os.getenv("HOMELAB_EVENTS_ROOT", "/opt/homelab/events"))
|
||||
WORLD_DIR = Path(os.getenv("HOMELAB_WORLD_ROOT", "/opt/homelab/world"))
|
||||
ACTIONS_DIR = Path(os.getenv("HOMELAB_ACTIONS_ROOT", "/opt/homelab/actions"))
|
||||
CONFIG_DIR = Path(os.getenv("HOMELAB_CONFIG_ROOT", "/opt/homelab/config"))
|
||||
|
||||
STATIC_DIR = Path(__file__).parent
|
||||
|
||||
DEFAULT_CONFIG = {
|
||||
"operator_mode": "approval",
|
||||
"auto_mode": True,
|
||||
"action_thresholds": {
|
||||
"restart_ha": 0.8,
|
||||
"check_network": 0.9,
|
||||
},
|
||||
"default_threshold": 0.9,
|
||||
"allowed_auto_actions": ["restart_ha"],
|
||||
}
|
||||
|
||||
|
||||
def read_json_file(path, default=None):
|
||||
if not path.exists():
|
||||
return default if default is not None else []
|
||||
try:
|
||||
return json.loads(path.read_text())
|
||||
except Exception:
|
||||
return default if default is not None else []
|
||||
|
||||
|
||||
def get_config():
|
||||
config_path = STATE_DIR / "operator-config.json"
|
||||
if config_path.exists():
|
||||
return read_json_file(config_path, DEFAULT_CONFIG)
|
||||
return DEFAULT_CONFIG
|
||||
|
||||
|
||||
def save_config(config):
|
||||
STATE_DIR.mkdir(parents=True, exist_ok=True)
|
||||
(STATE_DIR / "operator-config.json").write_text(json.dumps(config, indent=2))
|
||||
|
||||
|
||||
def current_nodes():
|
||||
return read_json_file(WORLD_DIR / "nodes.json")
|
||||
|
||||
|
||||
def current_services():
|
||||
return read_json_file(WORLD_DIR / "services.json")
|
||||
|
||||
|
||||
def current_deployments():
|
||||
return read_json_file(WORLD_DIR / "deployments.json")
|
||||
|
||||
|
||||
def current_incidents():
|
||||
return read_json_file(WORLD_DIR / "incidents.json")
|
||||
|
||||
|
||||
def current_recommendations():
|
||||
return read_json_file(WORLD_DIR / "recommendations.json")
|
||||
|
||||
|
||||
def current_summary():
|
||||
path = WORLD_DIR / "runtime-summary.json"
|
||||
summary = read_json_file(path, default={})
|
||||
if summary:
|
||||
last_update_val = summary.get("last_update")
|
||||
if last_update_val:
|
||||
try:
|
||||
if isinstance(last_update_val, str):
|
||||
last_update = datetime.fromisoformat(last_update_val.replace('Z', '+00:00')).timestamp()
|
||||
else:
|
||||
last_update = float(last_update_val)
|
||||
except Exception:
|
||||
last_update = os.path.getmtime(path)
|
||||
else:
|
||||
last_update = os.path.getmtime(path)
|
||||
summary["last_update"] = last_update
|
||||
summary["stale"] = (time.time() - last_update) > 60
|
||||
return summary
|
||||
|
||||
|
||||
def current_events():
|
||||
return read_json_file(WORLD_DIR / "events.json", default=[])
|
||||
|
||||
|
||||
def current_actions():
|
||||
actions = {}
|
||||
statuses = ["pending", "approved", "running", "completed", "failed", "rejected"]
|
||||
for status in statuses:
|
||||
actions[status] = []
|
||||
status_dir = ACTIONS_DIR / status
|
||||
if status_dir.exists():
|
||||
for f in status_dir.glob("*.json"):
|
||||
data = read_json_file(f)
|
||||
if data:
|
||||
# Injects some metadata for UI
|
||||
data["id"] = data.get("action_id") or f.stem
|
||||
data["status"] = status
|
||||
actions[status].append(data)
|
||||
return actions
|
||||
|
||||
|
||||
def mutate_action(action_id, target_status):
|
||||
statuses = ["pending", "approved", "running", "completed", "failed", "rejected"]
|
||||
if target_status not in statuses:
|
||||
return False, f"Invalid target status: {target_status}"
|
||||
|
||||
# Find where the action is
|
||||
source_path = None
|
||||
current_status = None
|
||||
for status in statuses:
|
||||
p = ACTIONS_DIR / status / f"{action_id}.json"
|
||||
if p.exists():
|
||||
source_path = p
|
||||
current_status = status
|
||||
break
|
||||
|
||||
if not source_path:
|
||||
return False, f"Action {action_id} not found"
|
||||
|
||||
target_dir = ACTIONS_DIR / target_status
|
||||
target_dir.mkdir(parents=True, exist_ok=True)
|
||||
target_path = target_dir / f"{action_id}.json"
|
||||
|
||||
try:
|
||||
data = json.loads(source_path.read_text())
|
||||
data["status"] = target_status
|
||||
data["updated_at"] = time.time()
|
||||
|
||||
# Keep history of transitions
|
||||
history = data.get("transition_history", [])
|
||||
history.append({
|
||||
"from": current_status,
|
||||
"to": target_status,
|
||||
"timestamp": time.time()
|
||||
})
|
||||
data["transition_history"] = history
|
||||
|
||||
target_path.write_text(json.dumps(data, indent=2))
|
||||
if source_path != target_path:
|
||||
source_path.unlink()
|
||||
return True, "Success"
|
||||
except Exception as e:
|
||||
return False, str(e)
|
||||
|
||||
|
||||
def get_snapshot():
|
||||
nodes = current_nodes()
|
||||
services = current_services()
|
||||
incidents = current_incidents()
|
||||
events = current_events()
|
||||
summary = current_summary()
|
||||
|
||||
non_nominal = [s for s in services if s.get("health") != "nominal"]
|
||||
nominal_count = len(services) - len(non_nominal)
|
||||
|
||||
return {
|
||||
"timestamp": datetime.now(timezone.utc).isoformat(),
|
||||
"summary": summary,
|
||||
"nodes": nodes,
|
||||
"non_nominal_services": non_nominal,
|
||||
"nominal_service_count": nominal_count,
|
||||
"total_service_count": len(services),
|
||||
"incidents": incidents,
|
||||
"events": events[:10],
|
||||
}
|
||||
|
||||
|
||||
def send_json(status, payload, handler):
|
||||
body = (json.dumps(payload) + "\n").encode("utf-8")
|
||||
handler.send_response(status)
|
||||
handler.send_header("Content-Type", "application/json")
|
||||
handler.send_header("Content-Length", str(len(body)))
|
||||
handler.end_headers()
|
||||
handler.wfile.write(body)
|
||||
|
||||
|
||||
class Handler(BaseHTTPRequestHandler):
|
||||
def do_GET(self):
|
||||
if self.path == "/config":
|
||||
send_json(200, get_config(), self)
|
||||
return
|
||||
|
||||
if self.path == "/nodes":
|
||||
send_json(200, current_nodes(), self)
|
||||
return
|
||||
|
||||
if self.path == "/services":
|
||||
send_json(200, current_services(), self)
|
||||
return
|
||||
|
||||
if self.path == "/deployments":
|
||||
send_json(200, current_deployments(), self)
|
||||
return
|
||||
|
||||
if self.path == "/incidents":
|
||||
send_json(200, current_incidents(), self)
|
||||
return
|
||||
|
||||
if self.path == "/recommendations":
|
||||
send_json(200, current_recommendations(), self)
|
||||
return
|
||||
|
||||
if self.path == "/summary":
|
||||
send_json(200, current_summary(), self)
|
||||
return
|
||||
|
||||
if self.path == "/events":
|
||||
send_json(200, current_events(), self)
|
||||
return
|
||||
|
||||
if self.path == "/actions":
|
||||
send_json(200, current_actions(), self)
|
||||
return
|
||||
|
||||
if self.path == "/snapshot":
|
||||
send_json(200, get_snapshot(), self)
|
||||
return
|
||||
|
||||
if self.path in ("/", "/index.html"):
|
||||
body = (STATIC_DIR / "index.html").read_bytes()
|
||||
self.send_response(200)
|
||||
self.send_header("Content-Type", "text/html; charset=utf-8")
|
||||
self.send_header("Content-Length", str(len(body)))
|
||||
self.end_headers()
|
||||
self.wfile.write(body)
|
||||
return
|
||||
|
||||
self.send_error(404)
|
||||
|
||||
def do_POST(self):
|
||||
if self.path not in (
|
||||
"/config",
|
||||
"/action/mutate",
|
||||
"/mode",
|
||||
):
|
||||
self.send_error(404)
|
||||
return
|
||||
|
||||
length = int(self.headers.get("Content-Length", "0"))
|
||||
raw_body = self.rfile.read(length).decode("utf-8")
|
||||
try:
|
||||
payload = json.loads(raw_body)
|
||||
except json.JSONDecodeError:
|
||||
self.send_error(400, "Invalid JSON")
|
||||
return
|
||||
|
||||
if self.path == "/config":
|
||||
config = get_config()
|
||||
config.update(payload)
|
||||
save_config(config)
|
||||
send_json(200, {"status": "ok"}, self)
|
||||
return
|
||||
|
||||
if self.path == "/mode":
|
||||
mode = payload.get("mode")
|
||||
if not mode:
|
||||
self.send_error(400, "mode is required")
|
||||
return
|
||||
config = get_config()
|
||||
config["operator_mode"] = mode
|
||||
save_config(config)
|
||||
send_json(200, {"status": "ok"}, self)
|
||||
return
|
||||
|
||||
if self.path == "/action/mutate":
|
||||
action_id = payload.get("id")
|
||||
target = payload.get("status")
|
||||
if not action_id or not target:
|
||||
self.send_error(400, "id and status are required")
|
||||
return
|
||||
success, msg = mutate_action(action_id, target)
|
||||
if success:
|
||||
send_json(200, {"status": "ok"}, self)
|
||||
else:
|
||||
self.send_error(500, msg)
|
||||
return
|
||||
|
||||
def log_message(self, format, *args):
|
||||
return
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Ensure directories exist
|
||||
for d in [STATE_DIR, EVENTS_DIR, WORLD_DIR, ACTIONS_DIR, CONFIG_DIR]:
|
||||
d.mkdir(parents=True, exist_ok=True)
|
||||
for s in ["pending", "approved", "running", "completed", "failed", "rejected"]:
|
||||
(ACTIONS_DIR / s).mkdir(parents=True, exist_ok=True)
|
||||
|
||||
port = int(os.getenv("PORT", "8080"))
|
||||
print(f"Operator Control Plane starting on 0.0.0.0:{port}")
|
||||
server = ThreadingHTTPServer(("0.0.0.0", port), Handler)
|
||||
server.serve_forever()
|
||||
10
services/brain-watchdog/Dockerfile
Normal file
10
services/brain-watchdog/Dockerfile
Normal file
|
|
@ -0,0 +1,10 @@
|
|||
FROM python:3.11-slim
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
COPY src/ src/
|
||||
|
||||
ENV PYTHONUNBUFFERED=1
|
||||
ENV PYTHONPATH=/app/src
|
||||
|
||||
CMD ["python", "-m", "brain_watchdog.main"]
|
||||
30
services/brain-watchdog/docker-compose.yml
Normal file
30
services/brain-watchdog/docker-compose.yml
Normal file
|
|
@ -0,0 +1,30 @@
|
|||
services:
|
||||
brain-watchdog:
|
||||
build: .
|
||||
container_name: brain-watchdog
|
||||
restart: unless-stopped
|
||||
|
||||
env_file:
|
||||
- /opt/homelab/config/brain-watchdog/.env
|
||||
|
||||
volumes:
|
||||
- brain_watchdog_data:/data
|
||||
|
||||
healthcheck:
|
||||
test:
|
||||
- "CMD"
|
||||
- "python"
|
||||
- "-c"
|
||||
- |
|
||||
import os, time, json, sys
|
||||
p = '/data/state.json'
|
||||
if not os.path.exists(p): sys.exit(1)
|
||||
age = time.time() - os.path.getmtime(p)
|
||||
sys.exit(0 if age < 300 else 1)
|
||||
interval: 1m
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
start_period: 30s
|
||||
|
||||
volumes:
|
||||
brain_watchdog_data:
|
||||
7
services/brain-watchdog/env.example
Normal file
7
services/brain-watchdog/env.example
Normal file
|
|
@ -0,0 +1,7 @@
|
|||
CONTROL_PLANE_URL=
|
||||
STALE_THRESHOLD=600
|
||||
INTERVAL=60
|
||||
FAILS_BEFORE_ALERT=3
|
||||
TG_TOKEN=
|
||||
TG_CHAT_ID=
|
||||
HEALTHCHECKS_URL=
|
||||
10
services/brain-watchdog/healthcheck.sh
Executable file
10
services/brain-watchdog/healthcheck.sh
Executable file
|
|
@ -0,0 +1,10 @@
|
|||
#!/bin/sh
|
||||
# Healthy if state.json was written within the last 5 minutes.
|
||||
python -c "
|
||||
import os, time, sys
|
||||
p = '/data/state.json'
|
||||
if not os.path.exists(p):
|
||||
sys.exit(1)
|
||||
age = time.time() - os.path.getmtime(p)
|
||||
sys.exit(0 if age < 300 else 1)
|
||||
"
|
||||
3
services/brain-watchdog/pytest.ini
Normal file
3
services/brain-watchdog/pytest.ini
Normal file
|
|
@ -0,0 +1,3 @@
|
|||
[pytest]
|
||||
pythonpath = src
|
||||
testpaths = tests
|
||||
34
services/brain-watchdog/service.yaml
Normal file
34
services/brain-watchdog/service.yaml
Normal file
|
|
@ -0,0 +1,34 @@
|
|||
service:
|
||||
name: brain-watchdog
|
||||
owner_node: piha
|
||||
exposure: private
|
||||
description: >
|
||||
External watchdog for the control-plane on VPS. Queries /summary over
|
||||
Tailscale and alerts via Telegram Bot API directly — no dependency on the
|
||||
control-plane itself. Freshness is computed locally from last_update epoch.
|
||||
|
||||
dependencies:
|
||||
- control-plane # external — on VPS; deliberately untrusted for liveness
|
||||
|
||||
healthcheck:
|
||||
type: docker
|
||||
interval: 60s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
start_period: 30s
|
||||
|
||||
restart_policy: unless-stopped
|
||||
|
||||
persistence:
|
||||
paths:
|
||||
- /data # state.json: fail_count, alerted, last_ok
|
||||
|
||||
runtime:
|
||||
env_vars:
|
||||
- CONTROL_PLANE_URL # Tailscale IP + port of operator-ui (required)
|
||||
- STALE_THRESHOLD # seconds before brain is considered stale (default: 600)
|
||||
- INTERVAL # poll interval seconds (default: 60)
|
||||
- FAILS_BEFORE_ALERT # consecutive failures before Telegram alert (default: 3)
|
||||
- TG_TOKEN # Telegram Bot API token (required)
|
||||
- TG_CHAT_ID # Telegram chat/user ID (required)
|
||||
- HEALTHCHECKS_URL # optional healthchecks.io ping URL
|
||||
157
services/brain-watchdog/src/brain_watchdog/main.py
Normal file
157
services/brain-watchdog/src/brain_watchdog/main.py
Normal file
|
|
@ -0,0 +1,157 @@
|
|||
"""
|
||||
brain-watchdog: external watchdog for the control-plane on VPS.
|
||||
|
||||
Runs on PIHA; queries /summary directly over Tailscale and alerts via
|
||||
Telegram Bot API without going through the control-plane itself.
|
||||
Never trusts the self-reported "status" field — freshness is computed
|
||||
locally from last_update epoch vs. time.time().
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import time
|
||||
import urllib.error
|
||||
import urllib.request
|
||||
from pathlib import Path
|
||||
|
||||
CONTROL_PLANE_URL = os.environ["CONTROL_PLANE_URL"].rstrip("/")
|
||||
STALE_THRESHOLD = int(os.environ.get("STALE_THRESHOLD", "600"))
|
||||
INTERVAL = int(os.environ.get("INTERVAL", "60"))
|
||||
FAILS_BEFORE_ALERT = int(os.environ.get("FAILS_BEFORE_ALERT", "3"))
|
||||
TG_TOKEN = os.environ["TG_TOKEN"]
|
||||
TG_CHAT_ID = os.environ["TG_CHAT_ID"]
|
||||
HEALTHCHECKS_URL = os.environ.get("HEALTHCHECKS_URL", "").strip()
|
||||
|
||||
STATE_FILE = Path("/data/state.json")
|
||||
|
||||
|
||||
def load_state() -> dict:
|
||||
if STATE_FILE.exists():
|
||||
try:
|
||||
return json.loads(STATE_FILE.read_text())
|
||||
except Exception:
|
||||
pass
|
||||
return {"fail_count": 0, "alerted": False, "last_ok": 0.0}
|
||||
|
||||
|
||||
def save_state(state: dict) -> None:
|
||||
STATE_FILE.parent.mkdir(parents=True, exist_ok=True)
|
||||
STATE_FILE.write_text(json.dumps(state))
|
||||
|
||||
|
||||
def http_get(url: str, timeout: int = 10) -> tuple[int | None, dict | None]:
|
||||
try:
|
||||
with urllib.request.urlopen(url, timeout=timeout) as resp:
|
||||
return resp.status, json.loads(resp.read())
|
||||
except urllib.error.HTTPError as exc:
|
||||
return exc.code, None
|
||||
except Exception:
|
||||
return None, None
|
||||
|
||||
|
||||
def send_telegram(message: str) -> bool:
|
||||
url = f"https://api.telegram.org/bot{TG_TOKEN}/sendMessage"
|
||||
payload = json.dumps(
|
||||
{"chat_id": TG_CHAT_ID, "text": message, "parse_mode": "HTML"}
|
||||
).encode()
|
||||
req = urllib.request.Request(
|
||||
url, data=payload, headers={"Content-Type": "application/json"}
|
||||
)
|
||||
try:
|
||||
with urllib.request.urlopen(req, timeout=10) as resp:
|
||||
return resp.status == 200
|
||||
except Exception as exc:
|
||||
print(f"[telegram] send failed: {exc}", flush=True)
|
||||
return False
|
||||
|
||||
|
||||
def ping_healthchecks() -> None:
|
||||
if not HEALTHCHECKS_URL:
|
||||
return
|
||||
try:
|
||||
urllib.request.urlopen(HEALTHCHECKS_URL, timeout=10)
|
||||
except Exception as exc:
|
||||
print(f"[healthchecks] ping failed: {exc}", flush=True)
|
||||
|
||||
|
||||
def check() -> tuple[bool, str]:
|
||||
"""Return (ok, human-readable reason). Never reads 'status' field."""
|
||||
status, body = http_get(f"{CONTROL_PLANE_URL}/summary")
|
||||
|
||||
if status is None:
|
||||
return False, "panel unreachable (connection error)"
|
||||
|
||||
if status != 200:
|
||||
return False, f"panel returned HTTP {status}"
|
||||
|
||||
if not body:
|
||||
return False, "panel returned empty / invalid JSON"
|
||||
|
||||
raw = body.get("last_update")
|
||||
if raw is None:
|
||||
return False, "summary missing last_update field"
|
||||
|
||||
try:
|
||||
last_update_ts = float(raw)
|
||||
except (TypeError, ValueError):
|
||||
return False, f"last_update not parseable: {raw!r}"
|
||||
|
||||
age = time.time() - last_update_ts
|
||||
if age > STALE_THRESHOLD:
|
||||
return False, (
|
||||
f"brain stale: last update {int(age // 60)}m ago "
|
||||
f"(threshold {STALE_THRESHOLD // 60}m)"
|
||||
)
|
||||
|
||||
return True, f"ok (age {int(age)}s)"
|
||||
|
||||
|
||||
def main() -> None:
|
||||
print(
|
||||
f"[brain-watchdog] starting — "
|
||||
f"url={CONTROL_PLANE_URL} "
|
||||
f"stale_threshold={STALE_THRESHOLD}s "
|
||||
f"interval={INTERVAL}s "
|
||||
f"fails_before_alert={FAILS_BEFORE_ALERT}",
|
||||
flush=True,
|
||||
)
|
||||
state = load_state()
|
||||
|
||||
while True:
|
||||
ok, reason = check()
|
||||
ts = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
|
||||
print(f"[{ts}] {'OK ' if ok else 'FAIL'} — {reason}", flush=True)
|
||||
|
||||
if ok:
|
||||
if state["alerted"]:
|
||||
send_telegram(
|
||||
"✅ <b>brain-watchdog: control-plane RECOVERED</b>\n"
|
||||
f"{reason}"
|
||||
)
|
||||
print("[telegram] sent recovery alert", flush=True)
|
||||
state["fail_count"] = 0
|
||||
state["alerted"] = False
|
||||
state["last_ok"] = time.time()
|
||||
save_state(state)
|
||||
ping_healthchecks()
|
||||
else:
|
||||
state["fail_count"] = state.get("fail_count", 0) + 1
|
||||
save_state(state)
|
||||
|
||||
if state["fail_count"] >= FAILS_BEFORE_ALERT and not state["alerted"]:
|
||||
sent = send_telegram(
|
||||
"🚨 <b>brain-watchdog: control-plane DOWN</b>\n"
|
||||
f"Reason: {reason}\n"
|
||||
f"Consecutive failures: {state['fail_count']}\n"
|
||||
f"URL: <code>{CONTROL_PLANE_URL}</code>"
|
||||
)
|
||||
if sent:
|
||||
state["alerted"] = True
|
||||
save_state(state)
|
||||
print("[telegram] sent alert", flush=True)
|
||||
|
||||
time.sleep(INTERVAL)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
0
services/brain-watchdog/tests/__init__.py
Normal file
0
services/brain-watchdog/tests/__init__.py
Normal file
66
services/brain-watchdog/tests/test_main.py
Normal file
66
services/brain-watchdog/tests/test_main.py
Normal file
|
|
@ -0,0 +1,66 @@
|
|||
"""
|
||||
Tests for brain_watchdog.main.
|
||||
|
||||
Module-level env vars are required at import time; set them before the first
|
||||
import of the module so tests can run without a real control-plane.
|
||||
"""
|
||||
import importlib.util
|
||||
import os
|
||||
import time
|
||||
from unittest.mock import patch
|
||||
|
||||
os.environ.setdefault("CONTROL_PLANE_URL", "http://test-cp:8080")
|
||||
os.environ.setdefault("TG_TOKEN", "test_token")
|
||||
os.environ.setdefault("TG_CHAT_ID", "12345")
|
||||
|
||||
import brain_watchdog.main as bwm
|
||||
|
||||
|
||||
def test_package_importable():
|
||||
spec = importlib.util.find_spec("brain_watchdog")
|
||||
assert spec is not None
|
||||
|
||||
|
||||
def test_check_ok_fresh():
|
||||
now = time.time()
|
||||
with patch.object(bwm, "http_get", return_value=(200, {"last_update": now - 10})):
|
||||
ok, reason = bwm.check()
|
||||
assert ok
|
||||
assert "ok" in reason
|
||||
|
||||
|
||||
def test_check_fail_stale():
|
||||
now = time.time()
|
||||
stale_ts = now - (bwm.STALE_THRESHOLD + 120)
|
||||
with patch.object(bwm, "http_get", return_value=(200, {"last_update": stale_ts})):
|
||||
ok, reason = bwm.check()
|
||||
assert not ok
|
||||
assert "stale" in reason
|
||||
|
||||
|
||||
def test_check_fail_unreachable():
|
||||
with patch.object(bwm, "http_get", return_value=(None, None)):
|
||||
ok, reason = bwm.check()
|
||||
assert not ok
|
||||
assert "unreachable" in reason
|
||||
|
||||
|
||||
def test_check_fail_http_error():
|
||||
with patch.object(bwm, "http_get", return_value=(503, None)):
|
||||
ok, reason = bwm.check()
|
||||
assert not ok
|
||||
assert "503" in reason
|
||||
|
||||
|
||||
def test_check_fail_missing_last_update():
|
||||
with patch.object(bwm, "http_get", return_value=(200, {"other": "data"})):
|
||||
ok, reason = bwm.check()
|
||||
assert not ok
|
||||
assert "last_update" in reason
|
||||
|
||||
|
||||
def test_check_fail_unparseable_timestamp():
|
||||
with patch.object(bwm, "http_get", return_value=(200, {"last_update": "not-a-number"})):
|
||||
ok, reason = bwm.check()
|
||||
assert not ok
|
||||
assert "parseable" in reason
|
||||
24
services/control-plane/Dockerfile
Normal file
24
services/control-plane/Dockerfile
Normal file
|
|
@ -0,0 +1,24 @@
|
|||
FROM python:3.11-slim
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
RUN pip install --no-cache-dir pyyaml
|
||||
|
||||
# Create homelab user
|
||||
RUN useradd -m -u 1000 homelab
|
||||
|
||||
# Copy sources
|
||||
COPY src/ /app/src/
|
||||
# Also need the observer script if we want to run it from here,
|
||||
# but I'll copy it from the repo during build or mount it.
|
||||
# Actually, I'll copy the entire scripts/ directory to /repo/scripts
|
||||
# so the supervisor/executor can find them.
|
||||
|
||||
# For simplicity, we'll assume the repo is mounted at /repo
|
||||
ENV REPO_ROOT=/repo
|
||||
ENV RUNTIME_PATH=/opt/homelab
|
||||
ENV PYTHONUNBUFFERED=1
|
||||
|
||||
# Default command (will be overridden in docker-compose)
|
||||
USER homelab
|
||||
CMD ["python", "src/operator_ui.py"]
|
||||
73
services/control-plane/deploy-local.sh
Executable file
73
services/control-plane/deploy-local.sh
Executable file
|
|
@ -0,0 +1,73 @@
|
|||
#!/bin/bash
|
||||
# services/control-plane/deploy-local.sh
|
||||
set -e
|
||||
|
||||
# 1. Validate it is deploying control-plane
|
||||
if [[ ! $(pwd) == *"/services/control-plane" ]]; then
|
||||
echo "Error: Script must be run from services/control-plane directory"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if [[ ! -f "docker-compose.yml" ]]; then
|
||||
echo "Error: docker-compose.yml not found"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "--- Preparing Control Plane Directories ---"
|
||||
# 2. Prepare required dirs
|
||||
# /opt/homelab/config
|
||||
# /opt/homelab/actions/{pending,approved,rejected,running,completed,failed}
|
||||
# /opt/homelab/world
|
||||
# /opt/homelab/state
|
||||
|
||||
DIRS=(
|
||||
"/opt/homelab/config"
|
||||
"/opt/homelab/actions/pending"
|
||||
"/opt/homelab/actions/approved"
|
||||
"/opt/homelab/actions/rejected"
|
||||
"/opt/homelab/actions/running"
|
||||
"/opt/homelab/actions/completed"
|
||||
"/opt/homelab/actions/failed"
|
||||
"/opt/homelab/world"
|
||||
"/opt/homelab/state"
|
||||
)
|
||||
|
||||
for dir in "${DIRS[@]}"; do
|
||||
if [ ! -d "$dir" ]; then
|
||||
echo "Creating $dir"
|
||||
sudo mkdir -p "$dir"
|
||||
fi
|
||||
done
|
||||
|
||||
# 3. chown/chmod for UID 1000 — self-healing: only calls sudo when actually needed
|
||||
echo "Checking /opt/homelab ownership..."
|
||||
_chown_needed=$(find /opt/homelab \( ! -uid 1000 -o ! -gid 1000 \) -print -quit 2>/dev/null)
|
||||
if [[ -n "$_chown_needed" ]]; then
|
||||
echo "Found files not owned by 1000:1000 (e.g. $_chown_needed) — fixing..."
|
||||
sudo chown -R 1000:1000 /opt/homelab
|
||||
else
|
||||
echo "Ownership already correct, skipping chown"
|
||||
fi
|
||||
|
||||
echo "Checking /opt/homelab directory permissions..."
|
||||
_chmod_needed=$(find /opt/homelab -type d ! -perm -775 -print -quit 2>/dev/null)
|
||||
if [[ -n "$_chmod_needed" ]]; then
|
||||
echo "Found directories with wrong permissions (e.g. $_chmod_needed) — fixing..."
|
||||
sudo chmod -R 775 /opt/homelab 2>/dev/null || true
|
||||
else
|
||||
echo "Permissions already correct, skipping chmod"
|
||||
fi
|
||||
|
||||
# 4. Run docker compose up -d --build --force-recreate
|
||||
echo "--- Starting Control Plane Services ---"
|
||||
COMPOSE_ARGS="-f docker-compose.yml"
|
||||
OVERRIDE_FILE="../../hosts/vps/runtime/control-plane/docker-compose.override.yml"
|
||||
if [ -f "$OVERRIDE_FILE" ]; then
|
||||
echo "Using override: $OVERRIDE_FILE"
|
||||
COMPOSE_ARGS="$COMPOSE_ARGS -f $OVERRIDE_FILE"
|
||||
fi
|
||||
docker compose $COMPOSE_ARGS up -d --build --force-recreate
|
||||
|
||||
# 5. Print docker ps for control-plane containers
|
||||
echo "--- Deployment Status ---"
|
||||
docker ps --filter "name=control-plane"
|
||||
76
services/control-plane/docker-compose.yml
Normal file
76
services/control-plane/docker-compose.yml
Normal file
|
|
@ -0,0 +1,76 @@
|
|||
services:
|
||||
operator-ui:
|
||||
build: .
|
||||
container_name: control-plane-ui
|
||||
user: "1000:1000"
|
||||
command: python src/operator_ui.py
|
||||
ports:
|
||||
- "18180:8080"
|
||||
volumes:
|
||||
- /opt/homelab:/opt/homelab
|
||||
restart: unless-stopped
|
||||
healthcheck:
|
||||
test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://127.0.0.1:8080/', timeout=3).read()"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
|
||||
observer:
|
||||
build: .
|
||||
container_name: control-plane-observer
|
||||
user: "1000:1000"
|
||||
command: python /repo/scripts/observer/observer.py
|
||||
volumes:
|
||||
- /opt/homelab:/opt/homelab
|
||||
- ../..:/repo:ro
|
||||
restart: unless-stopped
|
||||
environment:
|
||||
- REPO_ROOT=/repo
|
||||
- RUNTIME_PATH=/opt/homelab
|
||||
healthcheck:
|
||||
test: ["CMD", "test", "-f", "/opt/homelab/state/observer.heartbeat"]
|
||||
interval: 30s
|
||||
timeout: 5s
|
||||
retries: 3
|
||||
start_period: 5s
|
||||
|
||||
supervisor:
|
||||
build: .
|
||||
container_name: control-plane-supervisor
|
||||
user: "1000:1000"
|
||||
command: python src/supervisor.py
|
||||
volumes:
|
||||
- /opt/homelab:/opt/homelab
|
||||
- ../..:/repo:ro
|
||||
restart: unless-stopped
|
||||
environment:
|
||||
- REPO_ROOT=/repo
|
||||
- RUNTIME_PATH=/opt/homelab
|
||||
healthcheck:
|
||||
test: ["CMD", "test", "-f", "/opt/homelab/state/supervisor.heartbeat"]
|
||||
interval: 60s
|
||||
timeout: 5s
|
||||
retries: 3
|
||||
start_period: 10s
|
||||
|
||||
executor:
|
||||
build: .
|
||||
container_name: control-plane-executor
|
||||
user: "1000:1000"
|
||||
group_add:
|
||||
- "999"
|
||||
command: python src/executor.py
|
||||
volumes:
|
||||
- /opt/homelab:/opt/homelab
|
||||
- ../..:/repo
|
||||
- /var/run/docker.sock:/var/run/docker.sock
|
||||
restart: unless-stopped
|
||||
environment:
|
||||
- REPO_ROOT=/repo
|
||||
- RUNTIME_PATH=/opt/homelab
|
||||
healthcheck:
|
||||
test: ["CMD", "test", "-f", "/opt/homelab/state/executor.heartbeat"]
|
||||
interval: 30s
|
||||
timeout: 5s
|
||||
retries: 3
|
||||
start_period: 5s
|
||||
19
services/control-plane/pyproject.toml
Normal file
19
services/control-plane/pyproject.toml
Normal file
|
|
@ -0,0 +1,19 @@
|
|||
[build-system]
|
||||
requires = ["setuptools>=68"]
|
||||
build-backend = "setuptools.build_meta"
|
||||
|
||||
[project]
|
||||
name = "control-plane"
|
||||
version = "0.1.0"
|
||||
requires-python = ">=3.11"
|
||||
dependencies = [
|
||||
"pyyaml>=6.0",
|
||||
]
|
||||
|
||||
[project.optional-dependencies]
|
||||
dev = [
|
||||
"pytest>=8.1",
|
||||
]
|
||||
|
||||
[tool.pytest.ini_options]
|
||||
testpaths = ["tests"]
|
||||
246
services/control-plane/src/executor.py
Normal file
246
services/control-plane/src/executor.py
Normal file
|
|
@ -0,0 +1,246 @@
|
|||
import os
|
||||
import json
|
||||
import time
|
||||
import logging
|
||||
import subprocess
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
def _atomic_write_json(path: Path, data) -> None:
|
||||
"""Write JSON atomically: write to a sibling .tmp, fsync, then os.replace."""
|
||||
tmp = path.with_suffix(".tmp")
|
||||
with open(tmp, "w") as f:
|
||||
json.dump(data, f, indent=2)
|
||||
f.flush()
|
||||
os.fsync(f.fileno())
|
||||
os.replace(tmp, path)
|
||||
|
||||
# Constants and Paths
|
||||
RUNTIME_PATH = os.getenv("RUNTIME_PATH", "/opt/homelab")
|
||||
ACTIONS_DIR = Path(RUNTIME_PATH) / "actions"
|
||||
REPO_ROOT = Path(os.getenv("REPO_ROOT", "/repo"))
|
||||
|
||||
# SSH configuration
|
||||
# SSH_USER can be overridden per-deployment environment.
|
||||
SSH_USER = os.getenv("SSH_USER", "oskar")
|
||||
SSH_OPTIONS = [
|
||||
"-o", "StrictHostKeyChecking=no",
|
||||
"-o", "ConnectTimeout=10",
|
||||
"-o", "BatchMode=yes",
|
||||
]
|
||||
|
||||
# Logging setup
|
||||
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
|
||||
logger = logging.getLogger("executor")
|
||||
|
||||
|
||||
class Executor:
|
||||
def __init__(self):
|
||||
self._ensure_dirs()
|
||||
|
||||
def _ensure_dirs(self):
|
||||
for s in ["approved", "running", "completed", "failed", "rejected"]:
|
||||
(ACTIONS_DIR / s).mkdir(parents=True, exist_ok=True)
|
||||
|
||||
def process_actions(self):
|
||||
# Update heartbeat
|
||||
heartbeat_file = ACTIONS_DIR.parent / "state" / "executor.heartbeat"
|
||||
try:
|
||||
heartbeat_file.touch()
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to touch heartbeat file: {e}")
|
||||
|
||||
approved_dir = ACTIONS_DIR / "approved"
|
||||
action_files = sorted(approved_dir.glob("*.json"))
|
||||
|
||||
for action_file in action_files:
|
||||
self._execute_action(action_file)
|
||||
|
||||
def _execute_action(self, action_file):
|
||||
action_id = action_file.stem
|
||||
logger.info(f"Executing action: {action_id}")
|
||||
|
||||
# Move to running
|
||||
running_path = ACTIONS_DIR / "running" / f"{action_id}.json"
|
||||
try:
|
||||
with open(action_file, "r") as f:
|
||||
data = json.load(f)
|
||||
data["status"] = "running"
|
||||
data["started_at"] = time.time()
|
||||
_atomic_write_json(running_path, data)
|
||||
action_file.unlink()
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to move {action_id} to running: {e}")
|
||||
return
|
||||
|
||||
# Dispatch by action type
|
||||
success = False
|
||||
error_msg = ""
|
||||
try:
|
||||
action_type = data.get("type")
|
||||
node = data.get("node")
|
||||
service = data.get("service")
|
||||
|
||||
if action_type == "redeploy":
|
||||
# Full service redeploy via the repo deploy script
|
||||
cmd = [
|
||||
str(REPO_ROOT / "scripts" / "deploy" / "deploy-node.sh"),
|
||||
node,
|
||||
service
|
||||
]
|
||||
logger.info(f"Running command: {' '.join(cmd)}")
|
||||
result = subprocess.run(cmd, capture_output=True, text=True, cwd=str(REPO_ROOT))
|
||||
if result.returncode == 0:
|
||||
success = True
|
||||
else:
|
||||
success = False
|
||||
error_msg = result.stderr or result.stdout
|
||||
|
||||
elif action_type == "container_restart":
|
||||
# Lightweight restart: SSH to node and docker restart the container.
|
||||
# container_name is set by the supervisor; falls back to service name.
|
||||
container_name = data.get("container_name") or service
|
||||
success, error_msg = self._execute_container_restart(node, container_name)
|
||||
|
||||
elif action_type == "disk_cleanup":
|
||||
# Operator-approved aggressive Docker cleanup (image prune -a +
|
||||
# volume prune). Commands come from the action payload so the
|
||||
# supervisor controls exactly what runs; the executor adds a
|
||||
# safety check to reject anything touching protected paths.
|
||||
payload = data.get("payload", {})
|
||||
success, error_msg = self._execute_disk_cleanup(node, payload)
|
||||
|
||||
elif action_type == "alert_only":
|
||||
# Operator acknowledged the alert; no automated execution needed.
|
||||
success = True
|
||||
|
||||
else:
|
||||
success = False
|
||||
error_msg = f"Unknown action type: {action_type}"
|
||||
|
||||
except Exception as e:
|
||||
success = False
|
||||
error_msg = str(e)
|
||||
|
||||
# Move to completed/failed
|
||||
target_status = "completed" if success else "failed"
|
||||
target_path = ACTIONS_DIR / target_status / f"{action_id}.json"
|
||||
try:
|
||||
data["status"] = target_status
|
||||
data["finished_at"] = time.time()
|
||||
if not success:
|
||||
data["error"] = error_msg
|
||||
_atomic_write_json(target_path, data)
|
||||
running_path.unlink()
|
||||
logger.info(f"Action {action_id} {target_status}")
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to move {action_id} to {target_status}: {e}")
|
||||
|
||||
def _execute_container_restart(self, node, container_name, retry_delay=10):
|
||||
"""
|
||||
SSH to the target node and run `docker restart <container_name>`.
|
||||
|
||||
Attempts the restart up to 2 times (initial + 1 retry). If the first
|
||||
attempt fails, waits retry_delay seconds then tries once more before
|
||||
declaring the action failed.
|
||||
|
||||
Returns (success: bool, error_msg: str).
|
||||
"""
|
||||
cmd = [
|
||||
"ssh",
|
||||
*SSH_OPTIONS,
|
||||
f"{SSH_USER}@{node}",
|
||||
f"docker restart {container_name}",
|
||||
]
|
||||
logger.info(f"SSH container restart: {' '.join(cmd)}")
|
||||
|
||||
max_attempts = 2
|
||||
last_error = ""
|
||||
|
||||
for attempt in range(1, max_attempts + 1):
|
||||
result = subprocess.run(cmd, capture_output=True, text=True)
|
||||
|
||||
if result.returncode == 0:
|
||||
logger.info(
|
||||
f"Container '{container_name}' on {node} restarted successfully "
|
||||
f"(attempt {attempt}/{max_attempts})"
|
||||
)
|
||||
return True, ""
|
||||
|
||||
last_error = (result.stderr or result.stdout).strip()
|
||||
logger.warning(
|
||||
f"container_restart attempt {attempt}/{max_attempts} failed "
|
||||
f"for '{container_name}' on {node}: {last_error}"
|
||||
)
|
||||
|
||||
if attempt < max_attempts:
|
||||
logger.info(f"Retrying in {retry_delay}s...")
|
||||
time.sleep(retry_delay)
|
||||
|
||||
logger.error(
|
||||
f"container_restart exhausted all {max_attempts} attempts "
|
||||
f"for '{container_name}' on {node}"
|
||||
)
|
||||
return False, last_error
|
||||
|
||||
def _execute_disk_cleanup(self, node: str, payload: dict):
|
||||
"""
|
||||
SSH to the target node and run the operator-approved disk cleanup
|
||||
commands from the action payload.
|
||||
|
||||
Safety invariants enforced here regardless of payload content:
|
||||
- No command may reference /opt/homelab/data/, /opt/homelab/config/,
|
||||
or /opt/homelab/state/ (application data and configuration).
|
||||
- No command may contain rm -rf / or similar destructive patterns.
|
||||
If any command fails the safety check the entire action is rejected
|
||||
(not run at all) and the rejection reason is recorded.
|
||||
|
||||
Returns (success: bool, error_msg: str).
|
||||
"""
|
||||
commands = payload.get("commands", [
|
||||
"docker image prune -a -f",
|
||||
"docker volume prune -f",
|
||||
])
|
||||
|
||||
# Safety gate: reject commands that touch protected paths
|
||||
FORBIDDEN = [
|
||||
"/opt/homelab/data",
|
||||
"/opt/homelab/config",
|
||||
"/opt/homelab/state",
|
||||
"rm -rf /",
|
||||
]
|
||||
for cmd in commands:
|
||||
for forbidden in FORBIDDEN:
|
||||
if forbidden in cmd:
|
||||
msg = f"Rejected: command contains forbidden pattern '{forbidden}': {cmd}"
|
||||
logger.error(msg)
|
||||
return False, msg
|
||||
|
||||
full_command = " && ".join(commands)
|
||||
cmd = [
|
||||
"ssh",
|
||||
*SSH_OPTIONS,
|
||||
f"{SSH_USER}@{node}",
|
||||
full_command,
|
||||
]
|
||||
logger.info(f"Disk cleanup on {node}: {full_command}")
|
||||
|
||||
result = subprocess.run(cmd, capture_output=True, text=True)
|
||||
if result.returncode == 0:
|
||||
logger.info(f"Disk cleanup on {node} succeeded")
|
||||
return True, ""
|
||||
|
||||
error_msg = (result.stderr or result.stdout).strip()
|
||||
logger.error(f"Disk cleanup on {node} failed: {error_msg}")
|
||||
return False, error_msg
|
||||
|
||||
def loop(self, interval=10):
|
||||
logger.info("Starting executor loop")
|
||||
while True:
|
||||
self.process_actions()
|
||||
time.sleep(interval)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
executor = Executor()
|
||||
executor.loop()
|
||||
701
services/control-plane/src/index.html
Normal file
701
services/control-plane/src/index.html
Normal file
|
|
@ -0,0 +1,701 @@
|
|||
<!doctype html>
|
||||
<html lang="en">
|
||||
<head>
|
||||
<meta charset="utf-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1">
|
||||
<title>Operator Control Plane</title>
|
||||
<style>
|
||||
:root {
|
||||
--bg-color: #0a0c0e;
|
||||
--sidebar-color: #14171a;
|
||||
--card-color: #1c2024;
|
||||
--border-color: #2a3540;
|
||||
--text-color: #e7edf3;
|
||||
--text-muted: #94a3b8;
|
||||
--accent-color: #3eaf7c;
|
||||
--nominal: #3eaf7c;
|
||||
--degraded: #e7c000;
|
||||
--unstable: #e67e22;
|
||||
--reconciling: #3498db;
|
||||
--error: #c0392b;
|
||||
--safe: #3eaf7c;
|
||||
--guarded: #e67e22;
|
||||
--dangerous: #c0392b;
|
||||
}
|
||||
|
||||
body {
|
||||
margin: 0;
|
||||
font-family: 'Inter', system-ui, -apple-system, sans-serif;
|
||||
background: var(--bg-color);
|
||||
color: var(--text-color);
|
||||
display: flex;
|
||||
height: 100vh;
|
||||
overflow: hidden;
|
||||
}
|
||||
|
||||
/* Sidebar */
|
||||
.sidebar {
|
||||
width: 240px;
|
||||
background: var(--sidebar-color);
|
||||
border-right: 1px solid var(--border-color);
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
flex-shrink: 0;
|
||||
}
|
||||
|
||||
.sidebar-header {
|
||||
padding: 24px;
|
||||
font-weight: 800;
|
||||
font-size: 14px;
|
||||
letter-spacing: 0.1em;
|
||||
color: var(--accent-color);
|
||||
border-bottom: 1px solid var(--border-color);
|
||||
}
|
||||
|
||||
.nav-list {
|
||||
list-style: none;
|
||||
padding: 12px 0;
|
||||
margin: 0;
|
||||
flex-grow: 1;
|
||||
}
|
||||
|
||||
.nav-item {
|
||||
padding: 12px 24px;
|
||||
cursor: pointer;
|
||||
font-size: 14px;
|
||||
color: var(--text-muted);
|
||||
transition: all 0.2s;
|
||||
display: flex;
|
||||
align-items: center;
|
||||
gap: 12px;
|
||||
}
|
||||
|
||||
.nav-item:hover {
|
||||
background: rgba(255, 255, 255, 0.05);
|
||||
color: var(--text-color);
|
||||
}
|
||||
|
||||
.nav-item.active {
|
||||
background: rgba(62, 175, 124, 0.1);
|
||||
color: var(--accent-color);
|
||||
border-left: 3px solid var(--accent-color);
|
||||
}
|
||||
|
||||
.sidebar-footer {
|
||||
padding: 16px;
|
||||
border-top: 1px solid var(--border-color);
|
||||
font-size: 12px;
|
||||
}
|
||||
|
||||
/* Content Area */
|
||||
.main-content {
|
||||
flex-grow: 1;
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
overflow: hidden;
|
||||
}
|
||||
|
||||
header {
|
||||
height: 64px;
|
||||
border-bottom: 1px solid var(--border-color);
|
||||
display: flex;
|
||||
align-items: center;
|
||||
padding: 0 24px;
|
||||
justify-content: space-between;
|
||||
background: var(--bg-color);
|
||||
}
|
||||
|
||||
.view-title {
|
||||
font-size: 18px;
|
||||
font-weight: 600;
|
||||
}
|
||||
|
||||
.content-scroll {
|
||||
flex-grow: 1;
|
||||
overflow-y: auto;
|
||||
padding: 24px;
|
||||
}
|
||||
|
||||
/* Cards & Grids */
|
||||
.grid {
|
||||
display: grid;
|
||||
grid-template-columns: repeat(auto-fill, minmax(350px, 1fr));
|
||||
gap: 20px;
|
||||
}
|
||||
|
||||
.card {
|
||||
background: var(--card-color);
|
||||
border: 1px solid var(--border-color);
|
||||
padding: 20px;
|
||||
border-radius: 4px;
|
||||
position: relative;
|
||||
}
|
||||
|
||||
.card-header {
|
||||
display: flex;
|
||||
justify-content: space-between;
|
||||
align-items: center;
|
||||
margin-bottom: 16px;
|
||||
}
|
||||
|
||||
.card-title {
|
||||
font-weight: 700;
|
||||
font-size: 16px;
|
||||
}
|
||||
|
||||
/* Status Badges */
|
||||
.badge {
|
||||
padding: 4px 8px;
|
||||
border-radius: 4px;
|
||||
font-size: 11px;
|
||||
font-weight: 700;
|
||||
text-transform: uppercase;
|
||||
}
|
||||
|
||||
.status-nominal { background: rgba(62, 175, 124, 0.1); color: var(--nominal); }
|
||||
.status-degraded { background: rgba(231, 192, 0, 0.1); color: var(--degraded); }
|
||||
.status-unstable { background: rgba(230, 126, 34, 0.1); color: var(--unstable); }
|
||||
.status-reconciling { background: rgba(52, 152, 219, 0.1); color: var(--reconciling); }
|
||||
.status-error { background: rgba(192, 57, 43, 0.1); color: var(--error); }
|
||||
|
||||
/* Timeline */
|
||||
.timeline {
|
||||
display: flex;
|
||||
flex-direction: column;
|
||||
gap: 12px;
|
||||
}
|
||||
|
||||
.event {
|
||||
padding: 12px;
|
||||
border-left: 2px solid var(--border-color);
|
||||
background: rgba(255, 255, 255, 0.02);
|
||||
font-family: ui-monospace, monospace;
|
||||
font-size: 13px;
|
||||
}
|
||||
|
||||
.event.high { border-left-color: var(--error); }
|
||||
.event.medium { border-left-color: var(--unstable); }
|
||||
.event.low { border-left-color: var(--nominal); }
|
||||
|
||||
.event-header {
|
||||
display: flex;
|
||||
justify-content: space-between;
|
||||
margin-bottom: 4px;
|
||||
color: var(--text-muted);
|
||||
}
|
||||
|
||||
/* Forms & Inputs */
|
||||
.controls {
|
||||
display: flex;
|
||||
gap: 12px;
|
||||
margin-top: 20px;
|
||||
}
|
||||
|
||||
input, button {
|
||||
background: var(--card-color);
|
||||
border: 1px solid var(--border-color);
|
||||
color: var(--text-color);
|
||||
padding: 8px 16px;
|
||||
font-size: 14px;
|
||||
border-radius: 4px;
|
||||
}
|
||||
|
||||
button {
|
||||
cursor: pointer;
|
||||
font-weight: 600;
|
||||
}
|
||||
|
||||
button:hover { background: var(--border-color); }
|
||||
|
||||
.btn-primary { background: var(--accent-color); color: white; border: none; }
|
||||
.btn-primary:hover { background: #359b6d; }
|
||||
|
||||
/* Utility */
|
||||
.hidden { display: none !important; }
|
||||
.mono { font-family: ui-monospace, monospace; }
|
||||
.label { color: var(--text-muted); font-size: 12px; margin-bottom: 4px; }
|
||||
.value { font-weight: 500; margin-bottom: 12px; }
|
||||
|
||||
.risk-safe { background: rgba(62, 175, 124, 0.1); color: var(--safe); }
|
||||
.risk-guarded { background: rgba(230, 126, 34, 0.1); color: var(--guarded); }
|
||||
.risk-dangerous { background: rgba(192, 57, 43, 0.1); color: var(--dangerous); }
|
||||
|
||||
</style>
|
||||
</head>
|
||||
<body>
|
||||
<aside class="sidebar">
|
||||
<div class="sidebar-header">HOMELAB OPERATOR</div>
|
||||
<ul class="nav-list">
|
||||
<li class="nav-item active" onclick="showView('dashboard', this)">
|
||||
<span>Dashboard</span>
|
||||
</li>
|
||||
<li class="nav-item" onclick="showView('actions', this)">
|
||||
<span>Action Queue</span>
|
||||
</li>
|
||||
<li class="nav-item" onclick="showView('nodes', this)">
|
||||
<span>Nodes</span>
|
||||
</li>
|
||||
<li class="nav-item" onclick="showView('services', this)">
|
||||
<span>Services</span>
|
||||
</li>
|
||||
<li class="nav-item" onclick="showView('deployments', this)">
|
||||
<span>Deployments</span>
|
||||
</li>
|
||||
<li class="nav-item" onclick="showView('topology', this)">
|
||||
<span>Topology</span>
|
||||
</li>
|
||||
<li class="nav-item" onclick="showView('events', this)">
|
||||
<span>Events</span>
|
||||
</li>
|
||||
<li class="nav-item" onclick="showView('correlation', this)">
|
||||
<span>Correlation</span>
|
||||
</li>
|
||||
<li class="nav-item" onclick="showView('recommendations', this)">
|
||||
<span>Recommendations</span>
|
||||
</li>
|
||||
<li class="nav-item" onclick="showView('settings', this)">
|
||||
<span>Settings</span>
|
||||
</li>
|
||||
</ul>
|
||||
<div class="sidebar-footer">
|
||||
<div id="summary-status">System Status: Loading...</div>
|
||||
</div>
|
||||
</aside>
|
||||
|
||||
<main class="main-content">
|
||||
<div id="stale-banner" class="hidden" style="background:var(--error); color:white; padding:8px 24px; font-weight:bold; font-size:12px; text-align:center; letter-spacing:0.05em">
|
||||
RUNTIME STATE IS STALE
|
||||
</div>
|
||||
<header>
|
||||
<div style="display:flex; align-items:center; gap:20px">
|
||||
<div class="view-title" id="current-view-title">Dashboard</div>
|
||||
<select id="operator-mode" onchange="setOperatorMode(this.value)" style="background:var(--sidebar-color); border:1px solid var(--border-color); color:var(--accent-color); font-weight:bold; font-size:12px; padding:4px 8px">
|
||||
<option value="observe">OBSERVE</option>
|
||||
<option value="recommend">RECOMMEND</option>
|
||||
<option value="approval" selected>APPROVAL</option>
|
||||
<option value="autonomous">AUTONOMOUS</option>
|
||||
<option value="maintenance">MAINTENANCE</option>
|
||||
</select>
|
||||
</div>
|
||||
<div class="header-actions">
|
||||
<button onclick="refreshData()">Refresh</button>
|
||||
</div>
|
||||
</header>
|
||||
|
||||
<div class="content-scroll">
|
||||
<!-- Dashboard View -->
|
||||
<div id="view-dashboard" class="view">
|
||||
<div class="grid">
|
||||
<div class="card">
|
||||
<div class="card-title">System Overview</div>
|
||||
<div id="dashboard-summary" style="margin-top:20px"></div>
|
||||
</div>
|
||||
<div class="card">
|
||||
<div class="card-title">Pending Actions</div>
|
||||
<div id="dashboard-actions-summary" style="margin-top:20px"></div>
|
||||
</div>
|
||||
<div class="card">
|
||||
<div class="card-title">Active Incidents</div>
|
||||
<div id="dashboard-incidents" style="margin-top:20px"></div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Actions View -->
|
||||
<div id="view-actions" class="view hidden">
|
||||
<div style="display:grid; grid-template-columns: 1fr 1fr; gap:24px">
|
||||
<div>
|
||||
<h3>Pending Approval</h3>
|
||||
<div id="actions-pending" class="timeline"></div>
|
||||
</div>
|
||||
<div>
|
||||
<h3>Active / History</h3>
|
||||
<div id="actions-history" class="timeline"></div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Nodes View -->
|
||||
<div id="view-nodes" class="view hidden">
|
||||
<div class="grid" id="nodes-list"></div>
|
||||
</div>
|
||||
|
||||
<!-- Services View -->
|
||||
<div id="view-services" class="view hidden">
|
||||
<div class="grid" id="services-list"></div>
|
||||
</div>
|
||||
|
||||
<!-- Deployments View -->
|
||||
<div id="view-deployments" class="view hidden">
|
||||
<div class="grid" id="deployments-list"></div>
|
||||
</div>
|
||||
|
||||
<!-- Topology View -->
|
||||
<div id="view-topology" class="view hidden">
|
||||
<div class="card" style="min-height:500px">
|
||||
<div class="card-title">Runtime Topology</div>
|
||||
<div id="topology-map" style="margin-top:20px; display:flex; flex-wrap:wrap; gap:40px; justify-content:center"></div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- Events View -->
|
||||
<div id="view-events" class="view hidden">
|
||||
<div class="timeline" id="events-timeline"></div>
|
||||
</div>
|
||||
|
||||
<!-- Correlation View -->
|
||||
<div id="view-correlation" class="view hidden">
|
||||
<div id="correlation-chains" class="grid"></div>
|
||||
</div>
|
||||
|
||||
<!-- Recommendations View -->
|
||||
<div id="view-recommendations" class="view hidden">
|
||||
<div class="grid" id="recommendations-list"></div>
|
||||
</div>
|
||||
|
||||
<!-- Settings View -->
|
||||
<div id="view-settings" class="view hidden">
|
||||
<div class="card">
|
||||
<div class="card-title">Configuration</div>
|
||||
<div id="settings-content" style="margin-top:20px"></div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</main>
|
||||
|
||||
<script>
|
||||
let currentView = 'dashboard';
|
||||
const pollInterval = 5000;
|
||||
|
||||
function showView(viewId, el) {
|
||||
document.querySelectorAll('.view').forEach(v => v.classList.add('hidden'));
|
||||
document.getElementById('view-' + viewId).classList.remove('hidden');
|
||||
document.querySelectorAll('.nav-item').forEach(i => i.classList.remove('active'));
|
||||
if (el) el.classList.add('active');
|
||||
currentView = viewId;
|
||||
document.getElementById('current-view-title').textContent = viewId.charAt(0).toUpperCase() + viewId.slice(1);
|
||||
refreshData();
|
||||
}
|
||||
|
||||
async function fetchData(endpoint) {
|
||||
try {
|
||||
const res = await fetch(endpoint, {cache: 'no-store'});
|
||||
return await res.json();
|
||||
} catch (e) {
|
||||
console.error('Fetch error:', endpoint, e);
|
||||
return null;
|
||||
}
|
||||
}
|
||||
|
||||
async function postData(endpoint, data) {
|
||||
try {
|
||||
const res = await fetch(endpoint, {
|
||||
method: 'POST',
|
||||
headers: {'Content-Type': 'application/json'},
|
||||
body: JSON.stringify(data)
|
||||
});
|
||||
return await res.json();
|
||||
} catch (e) {
|
||||
console.error('Post error:', endpoint, e);
|
||||
return null;
|
||||
}
|
||||
}
|
||||
|
||||
async function mutateAction(id, status) {
|
||||
const res = await postData('/action/mutate', {id, status});
|
||||
if (res && res.status === 'ok') {
|
||||
refreshData();
|
||||
} else {
|
||||
alert('Mutation failed');
|
||||
}
|
||||
}
|
||||
|
||||
async function setOperatorMode(mode) {
|
||||
console.log('Operator mode set to:', mode);
|
||||
const res = await postData('/mode', {mode});
|
||||
if (res && res.status === 'ok') {
|
||||
console.log('Mode updated successfully');
|
||||
}
|
||||
}
|
||||
|
||||
function formatTime(ts) {
|
||||
if (!ts) return 'N/A';
|
||||
return new Date(ts * 1000).toLocaleString();
|
||||
}
|
||||
|
||||
function getStatusClass(status) {
|
||||
status = (status || '').toLowerCase();
|
||||
if (['nominal', 'healthy', 'ok', 'up'].includes(status)) return 'status-nominal';
|
||||
if (['degraded', 'warning'].includes(status)) return 'status-degraded';
|
||||
if (['unstable'].includes(status)) return 'status-unstable';
|
||||
if (['reconciling'].includes(status)) return 'status-reconciling';
|
||||
if (['error', 'down', 'failed'].includes(status)) return 'status-error';
|
||||
return '';
|
||||
}
|
||||
|
||||
async function refreshData() {
|
||||
// Refresh summary always
|
||||
const summary = await fetchData('/summary');
|
||||
if (summary) {
|
||||
const statusEl = document.getElementById('summary-status');
|
||||
statusEl.textContent = `System Status: ${summary.status.toUpperCase()}`;
|
||||
statusEl.className = 'sidebar-footer ' + getStatusClass(summary.status);
|
||||
|
||||
// Handle stale state
|
||||
const staleBanner = document.getElementById('stale-banner');
|
||||
if (summary.stale) {
|
||||
staleBanner.classList.remove('hidden');
|
||||
staleBanner.textContent = `CRITICAL: Runtime state is STALE (Last update: ${formatTime(summary.last_update)})`;
|
||||
} else {
|
||||
staleBanner.classList.add('hidden');
|
||||
}
|
||||
|
||||
if (currentView === 'dashboard') {
|
||||
const dashSummary = document.getElementById('dashboard-summary');
|
||||
dashSummary.innerHTML = `
|
||||
<div class="label">Nodes</div><div class="value">${summary.node_count}</div>
|
||||
<div class="label">Services</div><div class="value">${summary.service_count}</div>
|
||||
<div class="label">Last Update</div><div class="value">${formatTime(summary.last_update)}</div>
|
||||
`;
|
||||
}
|
||||
}
|
||||
|
||||
if (currentView === 'dashboard' || currentView === 'actions') {
|
||||
const actions = await fetchData('/actions');
|
||||
if (actions) {
|
||||
if (currentView === 'dashboard') {
|
||||
const dashActions = document.getElementById('dashboard-actions-summary');
|
||||
const pendingCount = actions.pending.length;
|
||||
dashActions.innerHTML = `
|
||||
<div class="label">Pending</div><div class="value" style="color:var(--guarded)">${pendingCount}</div>
|
||||
<div class="label">Running</div><div class="value" style="color:var(--reconciling)">${actions.running.length}</div>
|
||||
`;
|
||||
}
|
||||
if (currentView === 'actions') {
|
||||
const pendingEl = document.getElementById('actions-pending');
|
||||
const historyEl = document.getElementById('actions-history');
|
||||
|
||||
pendingEl.innerHTML = actions.pending.map(a => `
|
||||
<div class="card" style="margin-bottom:12px">
|
||||
<div class="card-header">
|
||||
<div class="card-title">${(a.action_type || a.type || 'unknown').toUpperCase()}</div>
|
||||
<span class="badge risk-${a.risk_level}">${a.risk_level}</span>
|
||||
</div>
|
||||
<p>${a.description || a.action_type || 'No description'}</p>
|
||||
<div class="label">Target</div><div class="value">${a.node || (a.target && a.target.node) || 'unknown'} ${(a.service || (a.target && a.target.service)) || ''}</div>
|
||||
<div class="label">Confidence</div><div class="value">${Math.round((a.confidence || 0)*100)}%</div>
|
||||
<div class="controls">
|
||||
<button class="btn-primary" onclick="mutateAction('${a.id}', 'approved')">Approve</button>
|
||||
<button onclick="mutateAction('${a.id}', 'rejected')">Reject</button>
|
||||
</div>
|
||||
</div>
|
||||
`).join('') || 'No pending actions.';
|
||||
|
||||
const history = [...actions.approved, ...actions.running, ...actions.completed, ...actions.failed, ...actions.rejected];
|
||||
historyEl.innerHTML = history.sort((a,b) => (b.timestamp || b.updated_at || 0) - (a.timestamp || a.updated_at || 0)).map(a => `
|
||||
<div class="event">
|
||||
<div class="event-header">
|
||||
<span>${(a.action_type || a.type || 'unknown').toUpperCase()}</span>
|
||||
<span class="badge ${getStatusClass(a.status)}">${a.status}</span>
|
||||
</div>
|
||||
<div>${a.description || a.action_type || 'No description'}</div>
|
||||
<small>${formatTime(a.timestamp || a.updated_at)} | Target: ${a.node || (a.target && a.target.node)}</small>
|
||||
${a.status === 'approved' ? `<div class="controls"><button class="btn-primary" onclick="mutateAction('${a.id}', 'running')">Execute</button></div>` : ''}
|
||||
${a.transition_history ? `
|
||||
<div style="margin-top:8px; font-size:10px; color:var(--text-muted)">
|
||||
<strong>Trace:</strong> ${a.transition_history.map(h => `${h.from}->${h.to}`).join(' → ')}
|
||||
</div>
|
||||
` : ''}
|
||||
</div>
|
||||
`).join('') || 'No history.';
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if (currentView === 'dashboard' || currentView === 'events') {
|
||||
const incidents = await fetchData('/incidents');
|
||||
if (currentView === 'dashboard') {
|
||||
const dashIncidents = document.getElementById('dashboard-incidents');
|
||||
if (!incidents || incidents.length === 0) {
|
||||
dashIncidents.textContent = 'No active incidents.';
|
||||
} else {
|
||||
dashIncidents.innerHTML = incidents.map(inc => `
|
||||
<div class="event ${inc.severity}">
|
||||
<strong>${inc.severity.toUpperCase()}:</strong> ${inc.message}<br>
|
||||
<small>${formatTime(inc.timestamp)} | Node: ${inc.node}</small>
|
||||
</div>
|
||||
`).join('');
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if (currentView === 'nodes') {
|
||||
const nodes = await fetchData('/nodes');
|
||||
const list = document.getElementById('nodes-list');
|
||||
list.innerHTML = nodes.map(node => `
|
||||
<div class="card">
|
||||
<div class="card-header">
|
||||
<div class="card-title">${node.hostname}</div>
|
||||
<span class="badge ${getStatusClass(node.health)}">${node.health}</span>
|
||||
</div>
|
||||
<div class="label">ID</div><div class="value mono">${node.id}</div>
|
||||
<div class="label">Capabilities</div><div class="value">${node.capabilities.join(', ')}</div>
|
||||
<div class="label">Connectivity</div><div class="value">${node.connectivity}</div>
|
||||
<div class="label">Incidents (24h)</div><div class="value">${node.incidents}</div>
|
||||
<div class="label">Last Seen</div><div class="value">${formatTime(node.last_seen)}</div>
|
||||
<div class="label">Runtime Status</div><div class="value">${node.status}</div>
|
||||
</div>
|
||||
`).join('');
|
||||
}
|
||||
|
||||
if (currentView === 'services') {
|
||||
const services = await fetchData('/services');
|
||||
const list = document.getElementById('services-list');
|
||||
list.innerHTML = services.map(svc => `
|
||||
<div class="card">
|
||||
<div class="card-header">
|
||||
<div class="card-title">${svc.name}</div>
|
||||
<span class="badge ${getStatusClass(svc.health)}">${svc.health}</span>
|
||||
</div>
|
||||
<div class="label">State (Desired/Actual)</div><div class="value">${svc.desired_state} / ${svc.actual_state}</div>
|
||||
<div class="label">Deployment</div><div class="value">${svc.deployment_state}</div>
|
||||
<div class="label">Dependencies</div><div class="value">${svc.dependencies.join(', ') || 'None'}</div>
|
||||
<div class="label">Recommendations</div><div class="value">${svc.recommendations.join(', ') || 'None'}</div>
|
||||
</div>
|
||||
`).join('');
|
||||
}
|
||||
|
||||
if (currentView === 'deployments') {
|
||||
const deps = await fetchData('/deployments');
|
||||
const list = document.getElementById('deployments-list');
|
||||
list.innerHTML = deps.map(dep => `
|
||||
<div class="card">
|
||||
<div class="card-header">
|
||||
<div class="card-title">${dep.service}</div>
|
||||
<span class="badge ${dep.status === 'failed' ? 'status-error' : 'status-reconciling'}">${dep.status}</span>
|
||||
</div>
|
||||
<div class="label">ID</div><div class="value mono">${dep.id}</div>
|
||||
<div class="label">Stage</div><div class="value">${dep.stage}</div>
|
||||
<div class="label">Diagnostics</div><div class="value">${dep.diagnostics || 'No data'}</div>
|
||||
<div class="label">Resumable</div><div class="value">${dep.resumable ? 'Yes' : 'No'}</div>
|
||||
${dep.resumable ? '<button class="btn-primary">Resume</button>' : ''}
|
||||
</div>
|
||||
`).join('');
|
||||
}
|
||||
|
||||
if (currentView === 'events') {
|
||||
const events = await fetchData('/events');
|
||||
const timeline = document.getElementById('events-timeline');
|
||||
timeline.innerHTML = events.map(ev => `
|
||||
<div class="event ${ev.severity}">
|
||||
<div class="event-header">
|
||||
<span>${ev.type.toUpperCase()}</span>
|
||||
<span>${formatTime(ev.timestamp)}</span>
|
||||
</div>
|
||||
<div>${ev.message}</div>
|
||||
<div class="label" style="margin-top:8px">Node: ${ev.node} ${ev.service ? '| Service: ' + ev.service : ''}</div>
|
||||
</div>
|
||||
`).join('');
|
||||
}
|
||||
|
||||
if (currentView === 'recommendations') {
|
||||
const recs = await fetchData('/recommendations');
|
||||
const list = document.getElementById('recommendations-list');
|
||||
list.innerHTML = recs.map(rec => `
|
||||
<div class="card">
|
||||
<div class="card-header">
|
||||
<div class="card-title">${rec.title}</div>
|
||||
<span class="badge risk-${rec.risk_level}">${rec.risk_level}</span>
|
||||
</div>
|
||||
<p>${rec.description}</p>
|
||||
<div class="label">Confidence</div><div class="value">${Math.round(rec.confidence * 100)}%</div>
|
||||
<div class="label">Autonomous Eligible</div><div class="value">${rec.autonomous_eligible ? 'Yes' : 'No'}</div>
|
||||
<div class="label">Blocked Actions</div><div class="value">${rec.blocked_actions.join(', ') || 'None'}</div>
|
||||
<div class="controls">
|
||||
<button class="btn-primary" ${rec.risk_level === 'dangerous' ? 'style="background:var(--dangerous)"' : ''}>Approve Action</button>
|
||||
</div>
|
||||
</div>
|
||||
`).join('');
|
||||
}
|
||||
|
||||
if (currentView === 'topology') {
|
||||
const nodes = await fetchData('/nodes');
|
||||
const services = await fetchData('/services');
|
||||
const topMap = document.getElementById('topology-map');
|
||||
if (nodes && services) {
|
||||
topMap.innerHTML = nodes.map(node => {
|
||||
const nodeServices = services.filter(s => s.node === node.hostname || s.node === node.id);
|
||||
return `
|
||||
<div class="card" style="width:250px; border: 1px solid ${node.health === 'nominal' ? 'var(--border-color)' : 'var(--error)'}">
|
||||
<div class="card-header">
|
||||
<div class="card-title">${node.hostname}</div>
|
||||
<span class="badge ${getStatusClass(node.health)}">${node.health}</span>
|
||||
</div>
|
||||
<div class="label">Capabilities</div>
|
||||
<div class="value" style="font-size:11px">${node.capabilities.join(', ')}</div>
|
||||
<div class="label">Services</div>
|
||||
<div style="font-size:12px; margin-bottom:10px">
|
||||
${nodeServices.length > 0 ? nodeServices.map(s => `
|
||||
<div style="display:flex; justify-content:space-between; margin-bottom:4px; padding:4px; background:rgba(255,255,255,0.03)">
|
||||
<span>${s.name}</span>
|
||||
<span class="${getStatusClass(s.health)}" style="font-size:10px">${s.health}</span>
|
||||
</div>
|
||||
${s.dependencies.length > 0 ? `<div style="font-size:9px; color:var(--text-muted); margin-left:8px; margin-bottom:4px">dep: ${s.dependencies.join(', ')}</div>` : ''}
|
||||
`).join('') : '<div class="value">None</div>'}
|
||||
</div>
|
||||
</div>
|
||||
`;
|
||||
}).join('');
|
||||
}
|
||||
}
|
||||
|
||||
if (currentView === 'correlation') {
|
||||
const incidents = await fetchData('/incidents');
|
||||
const actions = await fetchData('/actions');
|
||||
const list = document.getElementById('correlation-chains');
|
||||
if (incidents && actions) {
|
||||
const allActions = Object.values(actions).flat();
|
||||
list.innerHTML = incidents.map(inc => {
|
||||
const related = allActions.filter(a => a.correlation_chain && a.correlation_chain.includes(inc.id));
|
||||
return `
|
||||
<div class="card">
|
||||
<div class="card-header">
|
||||
<div class="card-title">Incident: ${inc.id || 'INC-001'}</div>
|
||||
<span class="badge status-error">Active</span>
|
||||
</div>
|
||||
<p>${inc.message}</p>
|
||||
<div class="label">Related Actions</div>
|
||||
${related.map(a => `
|
||||
<div class="event" style="margin-top:5px">
|
||||
<strong>${a.type}</strong> (${a.status})<br>
|
||||
<small>${a.description}</small>
|
||||
</div>
|
||||
`).join('') || '<div class="value">No actions yet</div>'}
|
||||
</div>
|
||||
`;
|
||||
}).join('');
|
||||
}
|
||||
}
|
||||
if (currentView === 'settings') {
|
||||
const config = await fetchData('/config');
|
||||
const content = document.getElementById('settings-content');
|
||||
content.innerHTML = `
|
||||
<div class="label">Auto Mode</div>
|
||||
<div class="value">${config.auto_mode ? 'Enabled' : 'Disabled'}</div>
|
||||
<div class="label">Action Thresholds</div>
|
||||
<div class="value mono">${JSON.stringify(config.action_thresholds, null, 2)}</div>
|
||||
<div class="label">Telegram Integration</div>
|
||||
<div class="value" style="color:var(--text-muted)">Ready for mobile approval flows. Hook: /api/v1/telegram/webhook</div>
|
||||
<button onclick="alert('Settings update not implemented in this demo')">Edit Configuration</button>
|
||||
`;
|
||||
}
|
||||
}
|
||||
|
||||
// Initial load
|
||||
refreshData();
|
||||
// Poll for updates
|
||||
setInterval(refreshData, pollInterval);
|
||||
|
||||
</script>
|
||||
</body>
|
||||
</html>
|
||||
426
services/control-plane/src/operator_ui.py
Normal file
426
services/control-plane/src/operator_ui.py
Normal file
|
|
@ -0,0 +1,426 @@
|
|||
import heapq
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import time
|
||||
from datetime import datetime
|
||||
from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
STATE_DIR = Path(os.getenv("HOMELAB_STATE_ROOT", "/opt/homelab/state"))
|
||||
EVENTS_DIR = Path(os.getenv("HOMELAB_EVENTS_ROOT", "/opt/homelab/events"))
|
||||
WORLD_DIR = Path(os.getenv("HOMELAB_WORLD_ROOT", "/opt/homelab/world"))
|
||||
ACTIONS_DIR = Path(os.getenv("HOMELAB_ACTIONS_ROOT", "/opt/homelab/actions"))
|
||||
CONFIG_DIR = Path(os.getenv("HOMELAB_CONFIG_ROOT", "/opt/homelab/config"))
|
||||
|
||||
STATIC_DIR = Path(__file__).parent
|
||||
|
||||
_EVENT_TS_RE = re.compile(r"-(\d{9,11})-")
|
||||
|
||||
DEFAULT_CONFIG = {
|
||||
"operator_mode": "approval",
|
||||
"auto_mode": True,
|
||||
"action_thresholds": {
|
||||
"restart_ha": 0.8,
|
||||
"check_network": 0.9,
|
||||
},
|
||||
"default_threshold": 0.9,
|
||||
"allowed_auto_actions": ["restart_ha"],
|
||||
}
|
||||
|
||||
|
||||
def read_json_file(path, default=None):
|
||||
if not path.exists():
|
||||
return default if default is not None else []
|
||||
try:
|
||||
return json.loads(path.read_text())
|
||||
except Exception:
|
||||
return default if default is not None else []
|
||||
|
||||
|
||||
def get_config():
|
||||
config_path = STATE_DIR / "operator-config.json"
|
||||
if config_path.exists():
|
||||
return read_json_file(config_path, DEFAULT_CONFIG)
|
||||
return DEFAULT_CONFIG
|
||||
|
||||
|
||||
def save_config(config):
|
||||
STATE_DIR.mkdir(parents=True, exist_ok=True)
|
||||
(STATE_DIR / "operator-config.json").write_text(json.dumps(config, indent=2))
|
||||
|
||||
|
||||
EVENTS_MAX_AGE_HOURS = int(os.getenv("EVENTS_MAX_AGE_HOURS", "24"))
|
||||
EVENTS_MAX_COUNT = int(os.getenv("EVENTS_MAX_COUNT", "200"))
|
||||
|
||||
|
||||
def _node_health(info):
|
||||
status = info.get("status", "unknown")
|
||||
if status == "offline":
|
||||
return "error"
|
||||
if info.get("disk_pressure") == "high":
|
||||
return "degraded"
|
||||
if status == "online":
|
||||
return "nominal"
|
||||
return status
|
||||
|
||||
|
||||
def current_nodes():
|
||||
"""Return nodes as a list of dicts shaped for the UI.
|
||||
|
||||
The observer stores nodes as a keyed dict {node_name: {...}}. The frontend
|
||||
calls .map() which requires an array, so we convert here rather than change
|
||||
the on-disk format (which the supervisor also reads).
|
||||
"""
|
||||
raw = read_json_file(WORLD_DIR / "nodes.json", default={})
|
||||
if isinstance(raw, list):
|
||||
return raw
|
||||
result = []
|
||||
for name, info in raw.items():
|
||||
result.append({
|
||||
"id": name,
|
||||
"hostname": name,
|
||||
"health": _node_health(info),
|
||||
"status": info.get("status", "unknown"),
|
||||
"capabilities": info.get("roles", []),
|
||||
"connectivity": "tailscale",
|
||||
"incidents": 0,
|
||||
"last_seen": info.get("last_seen"),
|
||||
"disk_usage_pct": info.get("disk_usage_pct"),
|
||||
"mem_usage_pct": info.get("mem_usage_pct"),
|
||||
"cpu_usage_pct": info.get("cpu_usage_pct"),
|
||||
"disk_pressure": info.get("disk_pressure"),
|
||||
})
|
||||
return result
|
||||
|
||||
|
||||
def current_services():
|
||||
"""Return services as a list of dicts shaped for the UI.
|
||||
|
||||
Observer stores services as {"node/service": {...}}. Converted to a list
|
||||
with the fields the services and topology views expect.
|
||||
"""
|
||||
raw = read_json_file(WORLD_DIR / "services.json", default={})
|
||||
if isinstance(raw, list):
|
||||
return raw
|
||||
result = []
|
||||
for key, info in raw.items():
|
||||
svc_status = info.get("status", "unknown")
|
||||
result.append({
|
||||
"id": key,
|
||||
"name": info.get("service", key),
|
||||
"node": info.get("node", ""),
|
||||
"health": ("nominal" if svc_status == "healthy"
|
||||
else ("error" if svc_status == "unhealthy"
|
||||
else svc_status)),
|
||||
"desired_state": "running",
|
||||
"actual_state": svc_status,
|
||||
"deployment_state": "deployed",
|
||||
"dependencies": [],
|
||||
"recommendations": [],
|
||||
"last_check": info.get("last_check"),
|
||||
"incident_id": info.get("incident_id"),
|
||||
})
|
||||
return result
|
||||
|
||||
|
||||
def current_deployments():
|
||||
"""Return deployments as a list sorted newest-first."""
|
||||
raw = read_json_file(WORLD_DIR / "deployments.json", default={})
|
||||
if isinstance(raw, list):
|
||||
return raw
|
||||
result = []
|
||||
for dep_id, info in raw.items():
|
||||
result.append({
|
||||
"id": dep_id,
|
||||
"service": info.get("service", ""),
|
||||
"node": info.get("node", ""),
|
||||
"status": info.get("status", "unknown"),
|
||||
"stage": info.get("status", "unknown"),
|
||||
"diagnostics": info.get("last_error", ""),
|
||||
"resumable": info.get("status") == "failed",
|
||||
"started_at": info.get("started_at"),
|
||||
"finished_at": info.get("finished_at"),
|
||||
})
|
||||
return sorted(result, key=lambda x: x.get("started_at") or 0, reverse=True)
|
||||
|
||||
|
||||
def current_incidents():
|
||||
"""Return active incidents as a list sorted most-recent-first.
|
||||
|
||||
Only incidents with status='active' are returned; resolved and cancelled
|
||||
records are excluded so the dashboard reflects the current operational state.
|
||||
"""
|
||||
raw = read_json_file(WORLD_DIR / "incidents.json", default={})
|
||||
if isinstance(raw, list):
|
||||
return [i for i in raw if i.get("status") == "active"]
|
||||
result = []
|
||||
for inc in raw.values():
|
||||
if inc.get("status") != "active":
|
||||
continue
|
||||
# Synthesise a human-readable message if not stored (observer doesn't set one).
|
||||
if "message" not in inc:
|
||||
inc = dict(inc)
|
||||
inc["message"] = (
|
||||
f"{inc.get('service', '?')} on {inc.get('node', '?')} "
|
||||
f"is {inc.get('trigger_type', 'unhealthy')}"
|
||||
)
|
||||
result.append(inc)
|
||||
return sorted(result, key=lambda x: x.get("last_occurrence") or 0, reverse=True)
|
||||
|
||||
|
||||
def current_recommendations():
|
||||
return read_json_file(WORLD_DIR / "recommendations.json")
|
||||
|
||||
|
||||
def current_summary():
|
||||
path = WORLD_DIR / "runtime-summary.json"
|
||||
summary = read_json_file(path, default={})
|
||||
if summary:
|
||||
last_update_val = summary.get("last_update")
|
||||
if last_update_val:
|
||||
try:
|
||||
if isinstance(last_update_val, str):
|
||||
last_update = datetime.fromisoformat(last_update_val.replace('Z', '+00:00')).timestamp()
|
||||
else:
|
||||
last_update = float(last_update_val)
|
||||
except Exception:
|
||||
last_update = os.path.getmtime(path)
|
||||
else:
|
||||
last_update = os.path.getmtime(path)
|
||||
summary["last_update"] = last_update
|
||||
summary["stale"] = (time.time() - last_update) > 60
|
||||
return summary
|
||||
|
||||
|
||||
def _event_file_ts(p: Path) -> int:
|
||||
"""Extract epoch timestamp from event filename: evt-<node>-<ts>-<type>-<svc>.json"""
|
||||
m = _EVENT_TS_RE.search(p.stem)
|
||||
return int(m.group(1)) if m else 0
|
||||
|
||||
|
||||
def current_events():
|
||||
"""Return the EVENTS_MAX_COUNT most-recent events, sorted newest-first.
|
||||
|
||||
Event files are named evt-<node>-<epoch>-<type>-<svc>.json. The directory
|
||||
can contain hundreds of thousands of files (one file per event, written by
|
||||
node-agent). Loading every file on each request causes catastrophic RSS
|
||||
growth — 242 k files ≈ 420 MB of Python objects + 100 MB JSON serialisation.
|
||||
|
||||
Fix: use heapq.nlargest to stream through file paths (O(N_files) time,
|
||||
O(EVENTS_MAX_COUNT) memory), extracting the epoch from the filename without
|
||||
opening any file. Only the winning EVENTS_MAX_COUNT files are then read.
|
||||
"""
|
||||
if not EVENTS_DIR.exists():
|
||||
return []
|
||||
|
||||
cutoff = time.time() - EVENTS_MAX_AGE_HOURS * 3600
|
||||
|
||||
# Stream all paths through a max-heap — never materialises the full list.
|
||||
candidates = heapq.nlargest(
|
||||
EVENTS_MAX_COUNT,
|
||||
EVENTS_DIR.glob("**/*.json"),
|
||||
key=_event_file_ts,
|
||||
)
|
||||
|
||||
events = []
|
||||
for f in candidates:
|
||||
data = read_json_file(f)
|
||||
if data and (data.get("timestamp") or 0) > cutoff:
|
||||
data["_source"] = f.name
|
||||
events.append(data)
|
||||
|
||||
return sorted(events, key=lambda x: x.get("timestamp") or 0, reverse=True)
|
||||
|
||||
|
||||
def current_actions():
|
||||
actions = {}
|
||||
statuses = ["pending", "approved", "running", "completed", "failed", "rejected"]
|
||||
for status in statuses:
|
||||
actions[status] = []
|
||||
status_dir = ACTIONS_DIR / status
|
||||
if status_dir.exists():
|
||||
for f in status_dir.glob("*.json"):
|
||||
data = read_json_file(f)
|
||||
if data:
|
||||
# Injects some metadata for UI
|
||||
data["id"] = data.get("action_id") or f.stem
|
||||
data["status"] = status
|
||||
actions[status].append(data)
|
||||
return actions
|
||||
|
||||
|
||||
def mutate_action(action_id, target_status):
|
||||
statuses = ["pending", "approved", "running", "completed", "failed", "rejected"]
|
||||
if target_status not in statuses:
|
||||
return False, f"Invalid target status: {target_status}"
|
||||
|
||||
# Find where the action is
|
||||
source_path = None
|
||||
current_status = None
|
||||
for status in statuses:
|
||||
p = ACTIONS_DIR / status / f"{action_id}.json"
|
||||
if p.exists():
|
||||
source_path = p
|
||||
current_status = status
|
||||
break
|
||||
|
||||
if not source_path:
|
||||
return False, f"Action {action_id} not found"
|
||||
|
||||
target_dir = ACTIONS_DIR / target_status
|
||||
target_dir.mkdir(parents=True, exist_ok=True)
|
||||
target_path = target_dir / f"{action_id}.json"
|
||||
|
||||
try:
|
||||
data = json.loads(source_path.read_text())
|
||||
data["status"] = target_status
|
||||
data["updated_at"] = time.time()
|
||||
|
||||
# Keep history of transitions
|
||||
history = data.get("transition_history", [])
|
||||
history.append({
|
||||
"from": current_status,
|
||||
"to": target_status,
|
||||
"timestamp": time.time()
|
||||
})
|
||||
data["transition_history"] = history
|
||||
|
||||
target_path.write_text(json.dumps(data, indent=2))
|
||||
if source_path != target_path:
|
||||
source_path.unlink()
|
||||
return True, "Success"
|
||||
except Exception as e:
|
||||
return False, str(e)
|
||||
|
||||
|
||||
def send_json(status, payload, handler):
|
||||
body = (json.dumps(payload) + "\n").encode("utf-8")
|
||||
handler.send_response(status)
|
||||
handler.send_header("Content-Type", "application/json")
|
||||
handler.send_header("Content-Length", str(len(body)))
|
||||
handler.end_headers()
|
||||
handler.wfile.write(body)
|
||||
|
||||
|
||||
class Handler(BaseHTTPRequestHandler):
|
||||
def do_GET(self):
|
||||
if self.path == "/config":
|
||||
send_json(200, get_config(), self)
|
||||
return
|
||||
|
||||
if self.path == "/nodes":
|
||||
send_json(200, current_nodes(), self)
|
||||
return
|
||||
|
||||
if self.path == "/services":
|
||||
send_json(200, current_services(), self)
|
||||
return
|
||||
|
||||
if self.path == "/deployments":
|
||||
send_json(200, current_deployments(), self)
|
||||
return
|
||||
|
||||
if self.path == "/incidents":
|
||||
send_json(200, current_incidents(), self)
|
||||
return
|
||||
|
||||
if self.path == "/recommendations":
|
||||
send_json(200, current_recommendations(), self)
|
||||
return
|
||||
|
||||
if self.path == "/summary":
|
||||
send_json(200, current_summary(), self)
|
||||
return
|
||||
|
||||
if self.path == "/events":
|
||||
send_json(200, current_events(), self)
|
||||
return
|
||||
|
||||
if self.path == "/actions":
|
||||
send_json(200, current_actions(), self)
|
||||
return
|
||||
|
||||
if self.path in ("/", "/index.html"):
|
||||
body = (STATIC_DIR / "index.html").read_bytes()
|
||||
self.send_response(200)
|
||||
self.send_header("Content-Type", "text/html; charset=utf-8")
|
||||
self.send_header("Content-Length", str(len(body)))
|
||||
self.end_headers()
|
||||
self.wfile.write(body)
|
||||
return
|
||||
|
||||
self.send_error(404)
|
||||
|
||||
def do_POST(self):
|
||||
if self.path not in (
|
||||
"/config",
|
||||
"/action/mutate",
|
||||
"/mode",
|
||||
):
|
||||
self.send_error(404)
|
||||
return
|
||||
|
||||
length = int(self.headers.get("Content-Length", "0"))
|
||||
raw_body = self.rfile.read(length).decode("utf-8")
|
||||
try:
|
||||
payload = json.loads(raw_body)
|
||||
except json.JSONDecodeError:
|
||||
self.send_error(400, "Invalid JSON")
|
||||
return
|
||||
|
||||
if self.path == "/config":
|
||||
config = get_config()
|
||||
config.update(payload)
|
||||
save_config(config)
|
||||
send_json(200, {"status": "ok"}, self)
|
||||
return
|
||||
|
||||
if self.path == "/mode":
|
||||
mode = payload.get("mode")
|
||||
if not mode:
|
||||
self.send_error(400, "mode is required")
|
||||
return
|
||||
config = get_config()
|
||||
config["operator_mode"] = mode
|
||||
save_config(config)
|
||||
send_json(200, {"status": "ok"}, self)
|
||||
return
|
||||
|
||||
if self.path == "/action/mutate":
|
||||
action_id = payload.get("id")
|
||||
target = payload.get("status")
|
||||
if not action_id or not target:
|
||||
self.send_error(400, "id and status are required")
|
||||
return
|
||||
success, msg = mutate_action(action_id, target)
|
||||
if success:
|
||||
send_json(200, {"status": "ok"}, self)
|
||||
else:
|
||||
self.send_error(500, msg)
|
||||
return
|
||||
|
||||
def log_message(self, format, *args):
|
||||
return
|
||||
|
||||
|
||||
class OperatorHTTPServer(ThreadingHTTPServer):
|
||||
# Use daemon threads so finished request threads do not accumulate in the
|
||||
# internal _threads list. ThreadingMixIn only tracks non-daemon threads
|
||||
# (for joining at server_close); with daemon_threads=True that list stays
|
||||
# empty, preventing unbounded growth of dead Thread objects over time.
|
||||
daemon_threads = True
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Ensure directories exist
|
||||
for d in [STATE_DIR, EVENTS_DIR, WORLD_DIR, ACTIONS_DIR, CONFIG_DIR]:
|
||||
d.mkdir(parents=True, exist_ok=True)
|
||||
for s in ["pending", "approved", "running", "completed", "failed", "rejected"]:
|
||||
(ACTIONS_DIR / s).mkdir(parents=True, exist_ok=True)
|
||||
|
||||
port = int(os.getenv("PORT", "8080"))
|
||||
print(f"Operator Control Plane starting on 0.0.0.0:{port}")
|
||||
server = OperatorHTTPServer(("0.0.0.0", port), Handler)
|
||||
server.serve_forever()
|
||||
771
services/control-plane/src/supervisor.py
Normal file
771
services/control-plane/src/supervisor.py
Normal file
|
|
@ -0,0 +1,771 @@
|
|||
import os
|
||||
import json
|
||||
import time
|
||||
import logging
|
||||
import yaml
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
def _atomic_write_json(path: Path, data) -> None:
|
||||
"""Write JSON atomically: write to a sibling .tmp, fsync, then os.replace."""
|
||||
tmp = path.with_suffix(".tmp")
|
||||
with open(tmp, "w") as f:
|
||||
json.dump(data, f, indent=2)
|
||||
f.flush()
|
||||
os.fsync(f.fileno())
|
||||
os.replace(tmp, path)
|
||||
|
||||
# Constants and Paths
|
||||
RUNTIME_PATH = os.getenv("RUNTIME_PATH", "/opt/homelab")
|
||||
WORLD_DIR = Path(RUNTIME_PATH) / "world"
|
||||
ACTIONS_DIR = Path(RUNTIME_PATH) / "actions"
|
||||
EVENTS_DIR = Path(RUNTIME_PATH) / "events"
|
||||
REPO_ROOT = Path(os.getenv("REPO_ROOT", "/repo"))
|
||||
|
||||
# Node alias map: maps alternative node names (as they appear in events/world state)
|
||||
# to canonical topology node names (as they appear in hosts/*/services.yaml and topology.yaml).
|
||||
# Override at runtime via NODE_ALIAS_MAP env var as a JSON string, e.g.:
|
||||
# NODE_ALIAS_MAP='{"node-2": "chelsty", "node-1": "piha"}'
|
||||
_NODE_ALIAS_ENV = os.getenv("NODE_ALIAS_MAP", "{}")
|
||||
try:
|
||||
NODE_ALIAS_MAP = json.loads(_NODE_ALIAS_ENV)
|
||||
except Exception:
|
||||
NODE_ALIAS_MAP = {}
|
||||
|
||||
# Event trigger types that should result in a lightweight container_restart
|
||||
# rather than a full redeploy. The container is present but not running,
|
||||
# or a dependency (MQTT) is unreachable — a restart is the right first step.
|
||||
CONTAINER_RESTART_TRIGGERS = {"containers_not_running", "mqtt_unreachable"}
|
||||
|
||||
# Nodes where automatic disk_cleanup actions must NOT be generated.
|
||||
# On chelsty nodes disk fullness is overwhelmingly caused by Frigate recordings
|
||||
# or the HA database — Docker cleanup will not help and the operator must
|
||||
# decide explicitly (e.g. adjust Frigate retain policy or purge HA recorder).
|
||||
NO_DISK_CLEANUP_NODES = {"chelsty-infra", "chelsty-ha"}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# HA diagnostic event routing (ha-diag-agent events)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
# ha_websocket_dead: HA WebSocket unresponsive → restart the homeassistant container.
|
||||
# Separate from CONTAINER_RESTART_TRIGGERS because these events are routed directly
|
||||
# from the events dir (not via the world-state drift loop) to avoid conflicts with
|
||||
# the stability-agent's independent container health tracking on the same service key.
|
||||
HA_CONTAINER_RESTART_EVENTS = {"ha_websocket_dead"}
|
||||
|
||||
# Alert-only events — operator notification, no automated action.
|
||||
HA_ALERT_ONLY_EVENTS = {
|
||||
"ha_integration_failed",
|
||||
"ha_entity_unavailable_long",
|
||||
"ha_automation_failing",
|
||||
"ha_update_available",
|
||||
"ha_recorder_lag",
|
||||
"ha_system_health_degraded",
|
||||
}
|
||||
|
||||
# Stable action-ID suffix for each alert-only type
|
||||
_HA_ALERT_ID_SUFFIX = {
|
||||
"ha_integration_failed": "integration-failed",
|
||||
"ha_entity_unavailable_long": "entity-unavailable",
|
||||
"ha_automation_failing": "automation-failing",
|
||||
"ha_update_available": "update-available",
|
||||
"ha_recorder_lag": "recorder-lag",
|
||||
"ha_system_health_degraded": "system-health-degraded",
|
||||
}
|
||||
|
||||
# 30-min cooldown after a container_restart completes; prevents restart loops
|
||||
# when HA repeatedly fails to connect (e.g. bad config, slow startup).
|
||||
HA_WEBSOCKET_RESTART_COOLDOWN = 1800
|
||||
|
||||
# 1-hour cooldown for alert-only events; avoids repeated Telegram noise for
|
||||
# persistent conditions (e.g. an entity that stays unavailable for hours).
|
||||
HA_ALERT_COOLDOWN = 3600
|
||||
|
||||
# Suppress ha_* events if homeassistant had a containers_not_running incident
|
||||
# within this window — HA is in a planned restart/update and alerts would be noise.
|
||||
HA_TRANSITION_WINDOW = 300 # 5 minutes
|
||||
|
||||
# When True, events that would generate container_restart are downgraded to alert_only
|
||||
# with a "[SHADOW MODE]" note. Safe default for initial deployment; set
|
||||
# HA_DIAG_SHADOW_MODE=false on the control-plane node when ready for live actions.
|
||||
HA_DIAG_SHADOW_MODE = os.getenv("HA_DIAG_SHADOW_MODE", "true").lower() == "true"
|
||||
|
||||
# Logging setup
|
||||
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
|
||||
logger = logging.getLogger("supervisor")
|
||||
|
||||
|
||||
class Supervisor:
|
||||
def __init__(self):
|
||||
self.desired_state = {"services": {}}
|
||||
self.actual_state = {"services": {}, "nodes": {}, "incidents": {}}
|
||||
# In-memory set of already-routed HA event IDs; prevents re-processing
|
||||
# on each reconcile cycle. Grows to at most ~hundreds of entries/day.
|
||||
self._ha_processed_event_ids: set = set()
|
||||
self._ensure_dirs()
|
||||
logger.info(
|
||||
"shadow_mode=%s — HA container_restart actions %s",
|
||||
HA_DIAG_SHADOW_MODE,
|
||||
"downgraded to alert_only" if HA_DIAG_SHADOW_MODE else "enabled",
|
||||
)
|
||||
|
||||
def _ensure_dirs(self):
|
||||
ACTIONS_DIR.mkdir(parents=True, exist_ok=True)
|
||||
(ACTIONS_DIR / "pending").mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Node name resolution
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
def _resolve_node(self, name):
|
||||
"""Resolve an event/world-state node name to its canonical topology name."""
|
||||
return NODE_ALIAS_MAP.get(name, name)
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Container name lookup
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
def _get_container_name(self, service):
|
||||
"""
|
||||
Determine the Docker container name for a service.
|
||||
Parses container_name from the service's docker-compose.yml.
|
||||
Falls back to the service name if not found.
|
||||
"""
|
||||
compose_path = REPO_ROOT / "services" / service / "docker-compose.yml"
|
||||
if compose_path.exists():
|
||||
try:
|
||||
with open(compose_path, "r") as f:
|
||||
compose = yaml.safe_load(f)
|
||||
for svc_block in compose.get("services", {}).values():
|
||||
cname = svc_block.get("container_name")
|
||||
if cname:
|
||||
return cname
|
||||
except Exception as e:
|
||||
logger.warning(f"Could not parse docker-compose for {service}: {e}")
|
||||
# Convention: container name matches service name
|
||||
return service
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# State loading
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
def _load_desired_state(self):
|
||||
services = {}
|
||||
hosts_dir = REPO_ROOT / "hosts"
|
||||
if not hosts_dir.exists():
|
||||
logger.warning(f"Hosts directory {hosts_dir} does not exist")
|
||||
return
|
||||
|
||||
for host_dir in hosts_dir.iterdir():
|
||||
if host_dir.is_dir():
|
||||
svc_file = host_dir / "services.yaml"
|
||||
if svc_file.exists():
|
||||
try:
|
||||
with open(svc_file, "r") as f:
|
||||
data = yaml.safe_load(f)
|
||||
host_name = data.get("host")
|
||||
for svc_name, svc_info in data.get("services", {}).items():
|
||||
svc_info = svc_info or {}
|
||||
# monitor: false — service is documented as desired but
|
||||
# intentionally excluded from supervisor action generation.
|
||||
# Use this when a service is not yet bootstrapped on an
|
||||
# offline/LTE node so the queue stays clean until it is.
|
||||
if svc_info.get("monitor") is False:
|
||||
logger.debug(
|
||||
f"Skipping {host_name}/{svc_name}: monitor=false"
|
||||
)
|
||||
continue
|
||||
svc_key = f"{host_name}/{svc_name}"
|
||||
services[svc_key] = {
|
||||
"node": host_name,
|
||||
"service": svc_name,
|
||||
"desired": "running"
|
||||
}
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to load {svc_file}: {e}")
|
||||
self.desired_state["services"] = services
|
||||
|
||||
def _load_actual_state(self) -> bool:
|
||||
"""Load world state from disk. Returns False if any file is unreadable
|
||||
(empty / mid-write truncation), in which case actual_state is NOT updated
|
||||
so the caller can skip this reconcile cycle rather than treating missing
|
||||
data as a real drift signal."""
|
||||
files = {
|
||||
"services": WORLD_DIR / "services.json",
|
||||
"nodes": WORLD_DIR / "nodes.json",
|
||||
"incidents": WORLD_DIR / "incidents.json"
|
||||
}
|
||||
raw = {}
|
||||
for key, path in files.items():
|
||||
if path.exists():
|
||||
try:
|
||||
with open(path, "r") as f:
|
||||
raw[key] = json.load(f)
|
||||
except Exception as e:
|
||||
logger.warning(
|
||||
f"World state {path.name} unreadable (truncated write?): {e} "
|
||||
f"— skipping reconcile cycle, keeping last known state"
|
||||
)
|
||||
return False
|
||||
else:
|
||||
raw[key] = {}
|
||||
|
||||
# Normalize node names in services using alias map so that
|
||||
# event-sourced names (e.g. "node-2") resolve to canonical
|
||||
# topology names (e.g. "chelsty") before comparison with desired state.
|
||||
normalized_services = {}
|
||||
for svc_key, svc_info in raw.get("services", {}).items():
|
||||
svc_info = dict(svc_info)
|
||||
raw_node = svc_info.get("node", "")
|
||||
canonical_node = self._resolve_node(raw_node)
|
||||
if canonical_node != raw_node:
|
||||
logger.debug(f"Resolved node alias: {raw_node} → {canonical_node}")
|
||||
svc_info["node"] = canonical_node
|
||||
svc_name = svc_info.get("service") or svc_key.split("/", 1)[-1]
|
||||
svc_key = f"{canonical_node}/{svc_name}"
|
||||
normalized_services[svc_key] = svc_info
|
||||
|
||||
# Normalize node names in incidents as well
|
||||
normalized_incidents = {}
|
||||
for inc_id, inc in raw.get("incidents", {}).items():
|
||||
inc = dict(inc)
|
||||
raw_node = inc.get("node", "")
|
||||
inc["node"] = self._resolve_node(raw_node)
|
||||
normalized_incidents[inc_id] = inc
|
||||
|
||||
self.actual_state["services"] = normalized_services
|
||||
self.actual_state["nodes"] = raw.get("nodes", {})
|
||||
self.actual_state["incidents"] = normalized_incidents
|
||||
return True
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Incident helpers
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
def _get_incident_trigger(self, svc_key):
|
||||
"""
|
||||
Return the trigger_type of the active incident for a service, or None.
|
||||
trigger_type is set by the observer when it creates an incident from
|
||||
a specific event type (e.g. 'containers_not_running', 'mqtt_unreachable').
|
||||
"""
|
||||
svc_info = self.actual_state["services"].get(svc_key, {})
|
||||
incident_id = svc_info.get("incident_id")
|
||||
if not incident_id:
|
||||
return None
|
||||
incident = self.actual_state["incidents"].get(incident_id, {})
|
||||
if incident.get("status") == "active":
|
||||
return incident.get("trigger_type")
|
||||
return None
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Reconciliation loop
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
def reconcile(self):
|
||||
# Update heartbeat
|
||||
heartbeat_file = WORLD_DIR.parent / "state" / "supervisor.heartbeat"
|
||||
try:
|
||||
heartbeat_file.touch()
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to touch heartbeat file: {e}")
|
||||
|
||||
self._load_desired_state()
|
||||
if not self._load_actual_state():
|
||||
return # world state unreadable this cycle — skip to avoid false drift
|
||||
|
||||
drifts = []
|
||||
|
||||
# 1. Check for missing or unhealthy services
|
||||
for svc_key, desired_info in self.desired_state["services"].items():
|
||||
actual_info = self.actual_state["services"].get(svc_key)
|
||||
|
||||
if not actual_info:
|
||||
drifts.append({
|
||||
"type": "missing_service",
|
||||
"svc_key": svc_key,
|
||||
"node": desired_info["node"],
|
||||
"service": desired_info["service"],
|
||||
"trigger_type": None,
|
||||
})
|
||||
elif actual_info.get("status") != "healthy":
|
||||
trigger_type = self._get_incident_trigger(svc_key)
|
||||
drifts.append({
|
||||
"type": "unhealthy_service",
|
||||
"svc_key": svc_key,
|
||||
"node": desired_info["node"],
|
||||
"service": desired_info["service"],
|
||||
"status": actual_info.get("status"),
|
||||
"trigger_type": trigger_type,
|
||||
})
|
||||
|
||||
# 2. Generate service-level recommendations
|
||||
for drift in drifts:
|
||||
self._generate_recommendation(drift)
|
||||
|
||||
# 3. Generate node-level recommendations (disk pressure)
|
||||
for node_name, node_info in self.actual_state["nodes"].items():
|
||||
if node_name in NO_DISK_CLEANUP_NODES:
|
||||
continue
|
||||
if node_info.get("disk_pressure") == "high":
|
||||
self._generate_disk_cleanup_recommendation(node_name)
|
||||
|
||||
# 4. Cancel pending actions whose drift has been resolved.
|
||||
# When a service becomes healthy again (because node-agent emits
|
||||
# service_healthy and the observer updates services.json), any
|
||||
# previously queued redeploy/container_restart action for that
|
||||
# service is no longer needed. Move it to "cancelled/" so the
|
||||
# operator can see it was auto-resolved rather than silently dropped.
|
||||
self._cancel_resolved_pending_actions()
|
||||
|
||||
# 5. Route HA diagnostic events emitted by ha-diag-agent.
|
||||
# Processed directly from the events directory — not via the world-state
|
||||
# drift loop — to avoid conflicts with stability-agent's independent
|
||||
# container health tracking for the homeassistant service.
|
||||
self._process_ha_events()
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Recommendation generation
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
def _generate_recommendation(self, drift):
|
||||
node = drift["node"]
|
||||
service = drift["service"]
|
||||
trigger_type = drift.get("trigger_type")
|
||||
|
||||
# Choose action type first so we can build the stable, deterministic ID.
|
||||
# Stable IDs mean reconcile is truly idempotent: the same drift always
|
||||
# produces the same filename, so we never create duplicates even across
|
||||
# restarts of the supervisor.
|
||||
if trigger_type in CONTAINER_RESTART_TRIGGERS:
|
||||
action_id = f"container-restart-{node}-{service}"
|
||||
else:
|
||||
action_id = f"redeploy-{node}-{service}"
|
||||
|
||||
# Skip if an action for this ID is already live in any active state
|
||||
# (pending → approved → running). This prevents re-creation after
|
||||
# a human approves an action that hasn't executed yet.
|
||||
for state in ("pending", "approved", "running"):
|
||||
if (ACTIONS_DIR / state / f"{action_id}.json").exists():
|
||||
logger.debug(f"Skipping {action_id}: already in state '{state}'")
|
||||
return
|
||||
|
||||
if trigger_type in CONTAINER_RESTART_TRIGGERS:
|
||||
# Lightweight remediation: the container exists but is not running
|
||||
# (containers_not_running) or its MQTT dependency is unreachable
|
||||
# (mqtt_unreachable). A docker restart is sufficient and low-risk.
|
||||
container_name = self._get_container_name(service)
|
||||
action = {
|
||||
"action_id": action_id,
|
||||
"timestamp": time.time(),
|
||||
"type": "container_restart",
|
||||
"node": node,
|
||||
"service": service,
|
||||
"container_name": container_name,
|
||||
"risk_level": "low",
|
||||
"confidence": 0.95,
|
||||
"description": (
|
||||
f"Restart container '{container_name}' on {node} "
|
||||
f"(service: {service}, reason: {trigger_type})"
|
||||
),
|
||||
"status": "pending",
|
||||
"payload": {
|
||||
"reason": trigger_type,
|
||||
"svc_key": drift["svc_key"],
|
||||
},
|
||||
}
|
||||
else:
|
||||
# Full redeploy: container is running but service is broken,
|
||||
# or the cause is unknown / not a simple restart candidate.
|
||||
action = {
|
||||
"action_id": action_id,
|
||||
"timestamp": time.time(),
|
||||
"type": "redeploy",
|
||||
"node": node,
|
||||
"service": service,
|
||||
"risk_level": "guarded",
|
||||
"confidence": 0.9,
|
||||
"description": f"Redeploy {service} on {node} due to {drift['type']}",
|
||||
"status": "pending",
|
||||
"payload": {
|
||||
"reason": drift["type"],
|
||||
"svc_key": drift["svc_key"],
|
||||
},
|
||||
}
|
||||
|
||||
action_path = ACTIONS_DIR / "pending" / f"{action_id}.json"
|
||||
try:
|
||||
_atomic_write_json(action_path, action)
|
||||
logger.info(
|
||||
f"Generated recommendation: {action_id} "
|
||||
f"(type={action['type']}, risk={action['risk_level']})"
|
||||
)
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to save recommendation {action_id}: {e}")
|
||||
|
||||
def _generate_disk_cleanup_recommendation(self, node: str):
|
||||
"""
|
||||
Generate a disk_cleanup action when node-agent reports critical disk
|
||||
pressure (>85 %) on a node that supports automated Docker cleanup.
|
||||
|
||||
This is an OPERATOR-APPROVED action (risk=guarded): it runs
|
||||
`docker image prune -a -f` and `docker volume prune -f`, which are
|
||||
more aggressive than the safe auto-cleanup the node-agent runs itself.
|
||||
|
||||
Nodes in NO_DISK_CLEANUP_NODES never reach this method (filtered in
|
||||
reconcile) because their disk fullness is caused by application data
|
||||
(Frigate, HA) that the operator must handle manually.
|
||||
"""
|
||||
action_id = f"disk-cleanup-{node}"
|
||||
|
||||
for state in ("pending", "approved", "running"):
|
||||
if (ACTIONS_DIR / state / f"{action_id}.json").exists():
|
||||
logger.debug(f"Skipping {action_id}: already in state '{state}'")
|
||||
return
|
||||
|
||||
action = {
|
||||
"action_id": action_id,
|
||||
"timestamp": time.time(),
|
||||
"type": "disk_cleanup",
|
||||
"node": node,
|
||||
"service": "",
|
||||
"risk_level": "guarded",
|
||||
"confidence": 0.85,
|
||||
"description": (
|
||||
f"Aggressive disk cleanup on {node}: docker image prune -a "
|
||||
f"and docker volume prune (requires operator approval)"
|
||||
),
|
||||
"status": "pending",
|
||||
"payload": {
|
||||
"reason": "disk_pressure",
|
||||
"commands": [
|
||||
"docker image prune -a -f",
|
||||
"docker volume prune -f",
|
||||
],
|
||||
},
|
||||
}
|
||||
|
||||
action_path = ACTIONS_DIR / "pending" / f"{action_id}.json"
|
||||
try:
|
||||
_atomic_write_json(action_path, action)
|
||||
logger.info(
|
||||
f"Generated disk cleanup recommendation: {action_id} "
|
||||
f"(node={node}, risk=guarded)"
|
||||
)
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to save disk cleanup recommendation {action_id}: {e}")
|
||||
|
||||
def _cancel_resolved_pending_actions(self):
|
||||
"""
|
||||
Auto-cancel pending service actions (redeploy / container_restart) whose
|
||||
target service is now healthy in the actual state.
|
||||
|
||||
This keeps the action queue clean: when node-agent starts reporting
|
||||
service_healthy for a container that previously had no world-state entry,
|
||||
the pending 'missing_service' redeploy action that was generated before
|
||||
the first health confirmation should be removed automatically rather than
|
||||
sitting in the queue until an operator manually rejects it.
|
||||
|
||||
Only pending actions are considered — approved/running actions have already
|
||||
been committed to by the operator and must not be cancelled automatically.
|
||||
"""
|
||||
cancelled_dir = ACTIONS_DIR / "cancelled"
|
||||
cancelled_dir.mkdir(parents=True, exist_ok=True)
|
||||
pending_dir = ACTIONS_DIR / "pending"
|
||||
if not pending_dir.exists():
|
||||
return
|
||||
|
||||
for action_file in list(pending_dir.glob("*.json")):
|
||||
try:
|
||||
with open(action_file, "r") as f:
|
||||
action = json.load(f)
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to read action {action_file.name}: {e}")
|
||||
continue
|
||||
|
||||
action_type = action.get("type")
|
||||
node = action.get("node")
|
||||
service = action.get("service")
|
||||
|
||||
# Only auto-cancel service-level actions (not disk_cleanup)
|
||||
if action_type not in ("redeploy", "container_restart"):
|
||||
continue
|
||||
if not node or not service:
|
||||
continue
|
||||
|
||||
svc_key = f"{node}/{service}"
|
||||
|
||||
cancel_reason = None
|
||||
|
||||
# Case 1: service is no longer in desired state (removed from services.yaml
|
||||
# or marked monitor:false). The action was generated under old config.
|
||||
if svc_key not in self.desired_state["services"]:
|
||||
cancel_reason = "service_removed_from_desired_state"
|
||||
|
||||
# Case 2: drift resolved — service is now healthy in actual state.
|
||||
elif self.actual_state["services"].get(svc_key, {}).get("status") == "healthy":
|
||||
cancel_reason = "drift_resolved_auto"
|
||||
|
||||
if cancel_reason:
|
||||
dest = cancelled_dir / action_file.name
|
||||
try:
|
||||
action["status"] = "cancelled"
|
||||
action["cancelled_reason"] = cancel_reason
|
||||
action["cancelled_at"] = time.time()
|
||||
_atomic_write_json(dest, action)
|
||||
action_file.unlink()
|
||||
logger.info(
|
||||
f"Auto-cancelled {action_file.name}: "
|
||||
f"{svc_key} — {cancel_reason}"
|
||||
)
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to cancel action {action_file.name}: {e}")
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# HA diagnostic event routing
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
def _process_ha_events(self):
|
||||
"""Scan the events directory for unprocessed ha_* events and route them."""
|
||||
if not EVENTS_DIR.exists():
|
||||
return
|
||||
for event_file in sorted(EVENTS_DIR.glob("**/*.json")):
|
||||
event_id = event_file.stem
|
||||
if event_id in self._ha_processed_event_ids:
|
||||
continue
|
||||
self._ha_processed_event_ids.add(event_id)
|
||||
try:
|
||||
with open(event_file) as f:
|
||||
event = json.load(f)
|
||||
except Exception as e:
|
||||
logger.debug(f"Could not read event {event_file}: {e}")
|
||||
continue
|
||||
if not event.get("type", "").startswith("ha_"):
|
||||
continue
|
||||
self._route_ha_event(event)
|
||||
|
||||
def _route_ha_event(self, event: dict):
|
||||
event_type = event.get("type", "")
|
||||
node = event.get("node", "")
|
||||
if not node:
|
||||
return
|
||||
|
||||
if event_type in HA_CONTAINER_RESTART_EVENTS:
|
||||
if self._is_ha_in_transition(node):
|
||||
logger.debug(
|
||||
f"Suppressing {event_type} on {node}: homeassistant in transition"
|
||||
)
|
||||
return
|
||||
if HA_DIAG_SHADOW_MODE:
|
||||
logger.info(
|
||||
"shadow_mode: suppressed container_restart for %s", event_type
|
||||
)
|
||||
self._generate_ha_shadow_alert(node, event)
|
||||
else:
|
||||
self._generate_ha_container_restart(node, event)
|
||||
|
||||
elif event_type == "ha_websocket_recovered":
|
||||
self._cancel_ha_container_restart(node)
|
||||
|
||||
elif event_type in HA_ALERT_ONLY_EVENTS:
|
||||
if self._is_ha_in_transition(node):
|
||||
logger.debug(
|
||||
f"Suppressing {event_type} on {node}: homeassistant in transition"
|
||||
)
|
||||
return
|
||||
self._generate_ha_alert_only(node, event)
|
||||
|
||||
def _is_ha_in_transition(self, node: str) -> bool:
|
||||
"""Return True if homeassistant container had a recent containers_not_running incident.
|
||||
|
||||
Suppresses ha_* alerts during planned HA restarts/updates to avoid
|
||||
flooding the operator with secondary diagnostic alerts.
|
||||
"""
|
||||
svc_key = f"{node}/homeassistant"
|
||||
svc_info = self.actual_state["services"].get(svc_key, {})
|
||||
incident_id = svc_info.get("incident_id")
|
||||
if not incident_id:
|
||||
return False
|
||||
incident = self.actual_state["incidents"].get(incident_id, {})
|
||||
return (
|
||||
incident.get("status") == "active"
|
||||
and incident.get("trigger_type") == "containers_not_running"
|
||||
and time.time() - (incident.get("last_occurrence") or 0) < HA_TRANSITION_WINDOW
|
||||
)
|
||||
|
||||
def _ha_action_recently_completed(self, action_id: str, cooldown: int) -> bool:
|
||||
"""Return True if action completed/rejected/cancelled within the cooldown window."""
|
||||
for state in ("completed", "rejected", "cancelled"):
|
||||
path = ACTIONS_DIR / state / f"{action_id}.json"
|
||||
if path.exists():
|
||||
try:
|
||||
with open(path) as f:
|
||||
data = json.load(f)
|
||||
finished = (
|
||||
data.get("finished_at")
|
||||
or data.get("cancelled_at")
|
||||
or data.get("updated_at")
|
||||
or 0
|
||||
)
|
||||
if time.time() - finished < cooldown:
|
||||
return True
|
||||
except Exception:
|
||||
pass
|
||||
return False
|
||||
|
||||
def _generate_ha_container_restart(self, node: str, event: dict):
|
||||
service = "homeassistant"
|
||||
action_id = f"container-restart-{node}-{service}"
|
||||
|
||||
for state in ("pending", "approved", "running"):
|
||||
if (ACTIONS_DIR / state / f"{action_id}.json").exists():
|
||||
logger.debug(f"Skipping {action_id}: already in state '{state}'")
|
||||
return
|
||||
|
||||
if self._ha_action_recently_completed(action_id, HA_WEBSOCKET_RESTART_COOLDOWN):
|
||||
logger.debug(
|
||||
f"Skipping {action_id}: within {HA_WEBSOCKET_RESTART_COOLDOWN}s cooldown"
|
||||
)
|
||||
return
|
||||
|
||||
payload = dict(event.get("payload", {}))
|
||||
payload["reason"] = "ha_websocket_dead"
|
||||
payload["svc_key"] = f"{node}/{service}"
|
||||
|
||||
container_name = self._get_container_name(service)
|
||||
action = {
|
||||
"action_id": action_id,
|
||||
"timestamp": time.time(),
|
||||
"type": "container_restart",
|
||||
"node": node,
|
||||
"service": service,
|
||||
"container_name": container_name,
|
||||
"risk_level": "low",
|
||||
"confidence": 0.9,
|
||||
"description": (
|
||||
f"Restart '{container_name}' on {node}: HA WebSocket unresponsive"
|
||||
),
|
||||
"status": "pending",
|
||||
"payload": payload,
|
||||
}
|
||||
self._write_pending_action(action)
|
||||
|
||||
def _generate_ha_shadow_alert(self, node: str, event: dict):
|
||||
"""Shadow-mode downgrade: emit alert_only instead of container_restart.
|
||||
|
||||
Uses the same action_id and cooldown as the real restart so that
|
||||
cooldown semantics are identical regardless of shadow mode state.
|
||||
"""
|
||||
service = "homeassistant"
|
||||
action_id = f"container-restart-{node}-{service}"
|
||||
|
||||
for state in ("pending", "approved", "running"):
|
||||
if (ACTIONS_DIR / state / f"{action_id}.json").exists():
|
||||
logger.debug(f"Skipping {action_id}: already in state '{state}'")
|
||||
return
|
||||
|
||||
if self._ha_action_recently_completed(action_id, HA_WEBSOCKET_RESTART_COOLDOWN):
|
||||
logger.debug(
|
||||
f"Skipping {action_id}: within {HA_WEBSOCKET_RESTART_COOLDOWN}s cooldown"
|
||||
)
|
||||
return
|
||||
|
||||
payload = dict(event.get("payload", {}))
|
||||
payload["reason"] = "ha_websocket_dead"
|
||||
payload["svc_key"] = f"{node}/{service}"
|
||||
payload["shadow_mode"] = True
|
||||
|
||||
action = {
|
||||
"action_id": action_id,
|
||||
"timestamp": time.time(),
|
||||
"type": "alert_only",
|
||||
"node": node,
|
||||
"service": service,
|
||||
"risk_level": "info",
|
||||
"confidence": 0.9,
|
||||
"description": (
|
||||
f"[SHADOW MODE] would have triggered container_restart "
|
||||
f"for {service} on {node}: HA WebSocket unresponsive"
|
||||
),
|
||||
"status": "pending",
|
||||
"payload": payload,
|
||||
}
|
||||
self._write_pending_action(action)
|
||||
|
||||
def _generate_ha_alert_only(self, node: str, event: dict):
|
||||
event_type = event.get("type", "")
|
||||
suffix = _HA_ALERT_ID_SUFFIX.get(event_type, event_type.replace("_", "-"))
|
||||
action_id = f"alert-ha-{suffix}-{node}"
|
||||
|
||||
for state in ("pending", "approved", "running"):
|
||||
if (ACTIONS_DIR / state / f"{action_id}.json").exists():
|
||||
logger.debug(f"Skipping {action_id}: already in state '{state}'")
|
||||
return
|
||||
|
||||
if self._ha_action_recently_completed(action_id, HA_ALERT_COOLDOWN):
|
||||
logger.debug(
|
||||
f"Skipping {action_id}: within {HA_ALERT_COOLDOWN}s cooldown"
|
||||
)
|
||||
return
|
||||
|
||||
payload = dict(event.get("payload", {}))
|
||||
payload["reason"] = event_type
|
||||
|
||||
action = {
|
||||
"action_id": action_id,
|
||||
"timestamp": time.time(),
|
||||
"type": "alert_only",
|
||||
"node": node,
|
||||
"service": event.get("service", "homeassistant"),
|
||||
"risk_level": "info",
|
||||
"confidence": 1.0,
|
||||
"description": event.get(
|
||||
"message", f"HA diagnostic alert: {event_type} on {node}"
|
||||
),
|
||||
"status": "pending",
|
||||
"payload": payload,
|
||||
}
|
||||
self._write_pending_action(action)
|
||||
|
||||
def _cancel_ha_container_restart(self, node: str):
|
||||
"""Move a pending ha_websocket_dead container_restart to cancelled on recovery."""
|
||||
action_id = f"container-restart-{node}-homeassistant"
|
||||
pending_path = ACTIONS_DIR / "pending" / f"{action_id}.json"
|
||||
if not pending_path.exists():
|
||||
return
|
||||
cancelled_dir = ACTIONS_DIR / "cancelled"
|
||||
cancelled_dir.mkdir(parents=True, exist_ok=True)
|
||||
dest = cancelled_dir / f"{action_id}.json"
|
||||
try:
|
||||
with open(pending_path) as f:
|
||||
action = json.load(f)
|
||||
action["status"] = "cancelled"
|
||||
action["cancelled_reason"] = "ha_websocket_recovered"
|
||||
action["cancelled_at"] = time.time()
|
||||
_atomic_write_json(dest, action)
|
||||
pending_path.unlink()
|
||||
logger.info(f"Cancelled {action_id}: ha_websocket_recovered on {node}")
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to cancel {action_id}: {e}")
|
||||
|
||||
def _write_pending_action(self, action: dict):
|
||||
action_id = action["action_id"]
|
||||
action_path = ACTIONS_DIR / "pending" / f"{action_id}.json"
|
||||
try:
|
||||
_atomic_write_json(action_path, action)
|
||||
logger.info(
|
||||
f"Generated HA action: {action_id} "
|
||||
f"(type={action['type']}, risk={action['risk_level']})"
|
||||
)
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to save action {action_id}: {e}")
|
||||
|
||||
def loop(self, interval=30):
|
||||
logger.info("Starting supervisor loop")
|
||||
while True:
|
||||
self.reconcile()
|
||||
time.sleep(interval)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
supervisor = Supervisor()
|
||||
supervisor.loop()
|
||||
0
services/control-plane/tests/__init__.py
Normal file
0
services/control-plane/tests/__init__.py
Normal file
333
services/control-plane/tests/test_incident_lifecycle.py
Normal file
333
services/control-plane/tests/test_incident_lifecycle.py
Normal file
|
|
@ -0,0 +1,333 @@
|
|||
"""Tests for incident lifecycle: auto-resolve, orphan detection, timestamp parsing."""
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
# Observer lives outside the control-plane package; add scripts/ to path.
|
||||
sys.path.insert(0, str(Path(__file__).parent.parent.parent.parent / "scripts"))
|
||||
from observer.observer import Observer, _parse_ts, _atomic_write_json
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def _make_observer(tmp_path: Path) -> Observer:
|
||||
"""Return an Observer with all runtime paths redirected to tmp_path."""
|
||||
import observer.observer as obs_mod
|
||||
|
||||
world = tmp_path / "world"
|
||||
state = tmp_path / "state"
|
||||
events = tmp_path / "events"
|
||||
logs = tmp_path / "logs"
|
||||
repo = tmp_path / "repo"
|
||||
|
||||
for d in (world, state, events, logs, repo / "inventory", repo / "hosts"):
|
||||
d.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Minimal topology so inventory isn't empty (avoids prune-guard early-return)
|
||||
(repo / "inventory" / "topology.yaml").write_text(
|
||||
"nodes:\n vps:\n roles: [control-plane]\n connectivity: {}\n"
|
||||
)
|
||||
|
||||
original_world = obs_mod.WORLD_DIR
|
||||
original_state = obs_mod.STATE_DIR
|
||||
original_events = obs_mod.EVENTS_DIR
|
||||
original_logs = obs_mod.LOGS_DIR
|
||||
original_inventory = obs_mod.INVENTORY_TOPOLOGY
|
||||
original_repo = obs_mod.REPO_ROOT
|
||||
|
||||
obs_mod.WORLD_DIR = world
|
||||
obs_mod.STATE_DIR = state
|
||||
obs_mod.EVENTS_DIR = events
|
||||
obs_mod.LOGS_DIR = logs
|
||||
obs_mod.INVENTORY_TOPOLOGY = repo / "inventory" / "topology.yaml"
|
||||
obs_mod.REPO_ROOT = repo
|
||||
|
||||
obs = Observer()
|
||||
|
||||
# Restore module-level constants (monkeypatching at module level is sufficient
|
||||
# for the Observer instance which captures paths at construction time via globals)
|
||||
obs_mod.WORLD_DIR = original_world
|
||||
obs_mod.STATE_DIR = original_state
|
||||
obs_mod.EVENTS_DIR = original_events
|
||||
obs_mod.LOGS_DIR = original_logs
|
||||
obs_mod.INVENTORY_TOPOLOGY = original_inventory
|
||||
obs_mod.REPO_ROOT = original_repo
|
||||
|
||||
return obs
|
||||
|
||||
|
||||
def _make_observer_simple(tmp_path: Path):
|
||||
"""Return an Observer instance and patch its world_state in-place."""
|
||||
import observer.observer as obs_mod
|
||||
|
||||
world = tmp_path / "world"
|
||||
state = tmp_path / "state"
|
||||
events = tmp_path / "events"
|
||||
logs = tmp_path / "logs"
|
||||
repo = tmp_path / "repo"
|
||||
|
||||
for d in (world, state, events, logs, repo / "inventory", repo / "hosts"):
|
||||
d.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
(repo / "inventory" / "topology.yaml").write_text(
|
||||
"nodes:\n vps:\n roles: [control-plane]\n connectivity: {}\n"
|
||||
)
|
||||
|
||||
# Patch before construction
|
||||
obs_mod.WORLD_DIR = world
|
||||
obs_mod.STATE_DIR = state
|
||||
obs_mod.EVENTS_DIR = events
|
||||
obs_mod.LOGS_DIR = logs
|
||||
obs_mod.INVENTORY_TOPOLOGY = repo / "inventory" / "topology.yaml"
|
||||
obs_mod.REPO_ROOT = repo
|
||||
|
||||
obs = Observer()
|
||||
return obs
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# 1. _parse_ts — timestamp normalisation
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def test_parse_ts_int():
|
||||
ts = int(time.time()) - 3600
|
||||
assert abs(_parse_ts(ts) - ts) < 1
|
||||
|
||||
|
||||
def test_parse_ts_float():
|
||||
ts = time.time() - 100.5
|
||||
assert abs(_parse_ts(ts) - ts) < 0.01
|
||||
|
||||
|
||||
def test_parse_ts_iso_string():
|
||||
# ISO format as emitted by events.py / stability-agent
|
||||
from datetime import datetime, timezone
|
||||
iso = "2026-06-01T00:03:22Z"
|
||||
expected = datetime(2026, 6, 1, 0, 3, 22, tzinfo=timezone.utc).timestamp()
|
||||
result = _parse_ts(iso)
|
||||
assert result > 0
|
||||
assert isinstance(result, float)
|
||||
assert abs(result - expected) < 1
|
||||
|
||||
|
||||
def test_parse_ts_none_returns_zero():
|
||||
assert _parse_ts(None) == 0.0
|
||||
|
||||
|
||||
def test_parse_ts_garbage_returns_zero():
|
||||
assert _parse_ts("not-a-date") == 0.0
|
||||
|
||||
|
||||
def test_parse_ts_zero_int():
|
||||
assert _parse_ts(0) == 0.0
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# 2. Lifecycle: service_healthy event resolves linked incident
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def test_service_healthy_resolves_active_incident(tmp_path):
|
||||
obs = _make_observer_simple(tmp_path)
|
||||
inc_id = "inc-111-vps-outline"
|
||||
obs.world_state["services"]["vps/outline"] = {
|
||||
"node": "vps", "service": "outline",
|
||||
"status": "unhealthy", "last_check": None,
|
||||
"incident_id": inc_id,
|
||||
}
|
||||
obs.world_state["incidents"][inc_id] = {
|
||||
"id": inc_id, "node": "vps", "service": "outline",
|
||||
"status": "active", "trigger_type": "service_unhealthy",
|
||||
"started_at": int(time.time()) - 600,
|
||||
"last_occurrence": int(time.time()) - 600,
|
||||
"occurrence_count": 1, "events": [],
|
||||
}
|
||||
|
||||
obs.process_event({
|
||||
"type": "service_healthy",
|
||||
"node": "vps",
|
||||
"service": "outline",
|
||||
"severity": "info",
|
||||
"timestamp": int(time.time()),
|
||||
"payload": {},
|
||||
})
|
||||
|
||||
assert obs.world_state["services"]["vps/outline"]["status"] == "healthy"
|
||||
assert obs.world_state["services"]["vps/outline"]["incident_id"] is None
|
||||
assert obs.world_state["incidents"][inc_id]["status"] == "resolved"
|
||||
|
||||
|
||||
def test_service_healthy_does_not_resolve_other_incidents(tmp_path):
|
||||
"""service_healthy for service A must not touch incident for service B."""
|
||||
obs = _make_observer_simple(tmp_path)
|
||||
inc_b = "inc-222-vps-supervisor"
|
||||
obs.world_state["services"]["vps/supervisor"] = {
|
||||
"node": "vps", "service": "supervisor",
|
||||
"status": "unhealthy", "last_check": None,
|
||||
"incident_id": inc_b,
|
||||
}
|
||||
obs.world_state["incidents"][inc_b] = {
|
||||
"id": inc_b, "status": "active",
|
||||
"last_occurrence": int(time.time()) - 300,
|
||||
}
|
||||
|
||||
obs.process_event({
|
||||
"type": "service_healthy",
|
||||
"node": "vps",
|
||||
"service": "outline", # different service
|
||||
"severity": "info",
|
||||
"timestamp": int(time.time()),
|
||||
"payload": {},
|
||||
})
|
||||
|
||||
assert obs.world_state["incidents"][inc_b]["status"] == "active"
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# 3. _prune_stale_world: healthy-service-linked incident → immediate resolve
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def test_prune_resolves_healthy_linked_incident(tmp_path):
|
||||
"""If a service is healthy but still points at an active incident, resolve it."""
|
||||
obs = _make_observer_simple(tmp_path)
|
||||
inc_id = "inc-333-vps-outline"
|
||||
obs.world_state["services"]["vps/outline"] = {
|
||||
"node": "vps", "service": "outline",
|
||||
"status": "healthy", # <-- healthy but incident_id still set
|
||||
"last_check": None,
|
||||
"incident_id": inc_id,
|
||||
}
|
||||
obs.world_state["incidents"][inc_id] = {
|
||||
"id": inc_id, "status": "active",
|
||||
"started_at": int(time.time()) - 7200,
|
||||
"last_occurrence": int(time.time()) - 7200,
|
||||
}
|
||||
|
||||
obs._prune_stale_world()
|
||||
|
||||
assert obs.world_state["services"]["vps/outline"]["incident_id"] is None
|
||||
assert obs.world_state["incidents"][inc_id]["status"] == "resolved"
|
||||
|
||||
|
||||
def test_prune_resolves_healthy_linked_incident_iso_timestamp(tmp_path):
|
||||
"""Healthy-linked incident with ISO-string last_occurrence must still resolve."""
|
||||
obs = _make_observer_simple(tmp_path)
|
||||
inc_id = "inc-444-vps-outline"
|
||||
obs.world_state["services"]["vps/outline"] = {
|
||||
"node": "vps", "service": "outline",
|
||||
"status": "healthy", "last_check": None, "incident_id": inc_id,
|
||||
}
|
||||
obs.world_state["incidents"][inc_id] = {
|
||||
"id": inc_id, "status": "active",
|
||||
"last_occurrence": "2026-06-01T00:03:22Z", # ISO string from events.py
|
||||
}
|
||||
|
||||
obs._prune_stale_world() # must not raise TypeError
|
||||
|
||||
assert obs.world_state["incidents"][inc_id]["status"] == "resolved"
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# 4. _prune_stale_world: orphaned incident (no service link) → resolve after 5 min
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def test_prune_resolves_orphaned_incident_old_enough(tmp_path):
|
||||
"""Orphaned active incident older than 5 min must be auto-resolved."""
|
||||
obs = _make_observer_simple(tmp_path)
|
||||
inc_id = "inc-555-vps-supervisor"
|
||||
# No service entry links to this incident
|
||||
obs.world_state["incidents"][inc_id] = {
|
||||
"id": inc_id, "status": "active", "node": "vps", "service": "supervisor",
|
||||
"last_occurrence": int(time.time()) - 400, # 6.7 min ago
|
||||
}
|
||||
|
||||
obs._prune_stale_world()
|
||||
|
||||
assert obs.world_state["incidents"][inc_id]["status"] == "resolved"
|
||||
|
||||
|
||||
def test_prune_does_not_resolve_orphaned_incident_too_recent(tmp_path):
|
||||
"""Orphaned incident younger than 5 min must stay active (guard against race)."""
|
||||
obs = _make_observer_simple(tmp_path)
|
||||
inc_id = "inc-666-vps-supervisor"
|
||||
obs.world_state["incidents"][inc_id] = {
|
||||
"id": inc_id, "status": "active",
|
||||
"last_occurrence": int(time.time()) - 60, # 1 min ago — within guard
|
||||
}
|
||||
|
||||
obs._prune_stale_world()
|
||||
|
||||
assert obs.world_state["incidents"][inc_id]["status"] == "active"
|
||||
|
||||
|
||||
def test_prune_resolves_orphaned_incident_iso_timestamp(tmp_path):
|
||||
"""Orphaned incident with ISO-string last_occurrence must resolve correctly."""
|
||||
obs = _make_observer_simple(tmp_path)
|
||||
inc_id = "inc-777-vps-outline"
|
||||
# ISO timestamp well in the past (2026-06-01)
|
||||
obs.world_state["incidents"][inc_id] = {
|
||||
"id": inc_id, "status": "active",
|
||||
"last_occurrence": "2026-06-01T00:03:22Z",
|
||||
}
|
||||
|
||||
obs._prune_stale_world() # must not raise TypeError
|
||||
|
||||
assert obs.world_state["incidents"][inc_id]["status"] == "resolved"
|
||||
|
||||
|
||||
def test_prune_does_not_touch_linked_incident(tmp_path):
|
||||
"""An active incident still linked from a non-healthy service must stay active."""
|
||||
obs = _make_observer_simple(tmp_path)
|
||||
inc_id = "inc-888-vps-outline"
|
||||
obs.world_state["services"]["vps/outline"] = {
|
||||
"node": "vps", "service": "outline",
|
||||
"status": "unhealthy", # <-- still unhealthy
|
||||
"last_check": None,
|
||||
"incident_id": inc_id,
|
||||
}
|
||||
obs.world_state["incidents"][inc_id] = {
|
||||
"id": inc_id, "status": "active",
|
||||
"last_occurrence": int(time.time()) - 3600,
|
||||
}
|
||||
|
||||
obs._prune_stale_world()
|
||||
|
||||
assert obs.world_state["incidents"][inc_id]["status"] == "active"
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# 5. 7-day stale incident prune with ISO resolved_at
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def test_prune_removes_old_resolved_incident_iso_resolved_at(tmp_path):
|
||||
"""Resolved incidents with ISO-string resolved_at older than 7 days must be pruned."""
|
||||
obs = _make_observer_simple(tmp_path)
|
||||
inc_id = "inc-old-resolved"
|
||||
obs.world_state["incidents"][inc_id] = {
|
||||
"id": inc_id, "status": "resolved",
|
||||
"resolved_at": "2026-05-01T00:00:00Z", # >7 days before 2026-06-03
|
||||
}
|
||||
|
||||
obs._prune_stale_world()
|
||||
|
||||
assert inc_id not in obs.world_state["incidents"]
|
||||
|
||||
|
||||
def test_prune_keeps_recently_resolved_incident(tmp_path):
|
||||
"""Resolved incidents within 7 days must be kept."""
|
||||
obs = _make_observer_simple(tmp_path)
|
||||
inc_id = "inc-recent-resolved"
|
||||
obs.world_state["incidents"][inc_id] = {
|
||||
"id": inc_id, "status": "resolved",
|
||||
"resolved_at": time.time() - 86400, # 1 day ago
|
||||
}
|
||||
|
||||
obs._prune_stale_world()
|
||||
|
||||
assert inc_id in obs.world_state["incidents"]
|
||||
199
services/control-plane/tests/test_state_reliability.py
Normal file
199
services/control-plane/tests/test_state_reliability.py
Normal file
|
|
@ -0,0 +1,199 @@
|
|||
"""Tests for atomic writes and resilient world-state loading in the supervisor."""
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent.parent / "src"))
|
||||
import supervisor as supervisor_module
|
||||
from supervisor import Supervisor, _atomic_write_json
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Helpers (reused from test_supervisor_ha)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def _setup_supervisor(tmp_path: Path, monkeypatch) -> Supervisor:
|
||||
actions = tmp_path / "actions"
|
||||
events = tmp_path / "events"
|
||||
world = tmp_path / "world"
|
||||
repo = tmp_path / "repo"
|
||||
|
||||
for d in (actions, events, world, repo / "hosts"):
|
||||
d.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
monkeypatch.setattr(supervisor_module, "ACTIONS_DIR", actions)
|
||||
monkeypatch.setattr(supervisor_module, "EVENTS_DIR", events)
|
||||
monkeypatch.setattr(supervisor_module, "WORLD_DIR", world)
|
||||
monkeypatch.setattr(supervisor_module, "REPO_ROOT", repo)
|
||||
|
||||
sup = Supervisor()
|
||||
sup.desired_state = {"services": {}}
|
||||
sup.actual_state = {"services": {}, "nodes": {}, "incidents": {}}
|
||||
return sup
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# 1. atomic_write_json correctness
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def test_atomic_write_json_produces_valid_json(tmp_path):
|
||||
path = tmp_path / "out.json"
|
||||
data = {"services": {"vps/outline": {"status": "healthy"}}, "count": 42}
|
||||
_atomic_write_json(path, data)
|
||||
|
||||
assert path.exists(), "output file must exist after atomic write"
|
||||
loaded = json.loads(path.read_text())
|
||||
assert loaded == data
|
||||
|
||||
|
||||
def test_atomic_write_json_no_tmp_left_behind(tmp_path):
|
||||
path = tmp_path / "world.json"
|
||||
_atomic_write_json(path, {"ok": True})
|
||||
|
||||
tmp = path.with_suffix(".tmp")
|
||||
assert not tmp.exists(), ".tmp must be cleaned up by os.replace"
|
||||
|
||||
|
||||
def test_atomic_write_json_overwrites_existing(tmp_path):
|
||||
path = tmp_path / "state.json"
|
||||
path.write_text('{"old": true}')
|
||||
_atomic_write_json(path, {"new": True})
|
||||
assert json.loads(path.read_text()) == {"new": True}
|
||||
|
||||
|
||||
def test_atomic_write_json_nested_structure(tmp_path):
|
||||
path = tmp_path / "complex.json"
|
||||
data = {
|
||||
"nodes": {"vps": {"status": "online", "disk_usage_pct": 42}},
|
||||
"incidents": {},
|
||||
"list": [1, 2, 3],
|
||||
}
|
||||
_atomic_write_json(path, data)
|
||||
assert json.loads(path.read_text()) == data
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# 2. Resilient loader: empty / truncated file → skip cycle, no drift
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def _populate_desired(sup: Supervisor, svc_key: str = "vps/outline"):
|
||||
node, service = svc_key.split("/", 1)
|
||||
sup.desired_state["services"][svc_key] = {
|
||||
"node": node,
|
||||
"service": service,
|
||||
"desired": "running",
|
||||
}
|
||||
|
||||
|
||||
def test_empty_services_json_skips_reconcile(tmp_path, monkeypatch):
|
||||
"""Empty services.json (truncated write) must not generate any redeploy action."""
|
||||
sup = _setup_supervisor(tmp_path, monkeypatch)
|
||||
_populate_desired(sup)
|
||||
|
||||
# Write empty services.json — simulates a mid-write truncation
|
||||
(tmp_path / "world" / "services.json").write_text("")
|
||||
(tmp_path / "world" / "nodes.json").write_text("{}")
|
||||
(tmp_path / "world" / "incidents.json").write_text("{}")
|
||||
|
||||
sup.reconcile()
|
||||
|
||||
pending = list((tmp_path / "actions" / "pending").glob("*.json"))
|
||||
assert pending == [], f"No actions should be generated on empty state file, got: {[p.name for p in pending]}"
|
||||
|
||||
|
||||
def test_truncated_services_json_skips_reconcile(tmp_path, monkeypatch):
|
||||
"""Partially-written (truncated mid-write) JSON must not generate any action."""
|
||||
sup = _setup_supervisor(tmp_path, monkeypatch)
|
||||
_populate_desired(sup)
|
||||
|
||||
(tmp_path / "world" / "services.json").write_text('{"vps/outline": {"status": "hea')
|
||||
(tmp_path / "world" / "nodes.json").write_text("{}")
|
||||
(tmp_path / "world" / "incidents.json").write_text("{}")
|
||||
|
||||
sup.reconcile()
|
||||
|
||||
pending = list((tmp_path / "actions" / "pending").glob("*.json"))
|
||||
assert pending == [], f"No actions expected on truncated state, got: {[p.name for p in pending]}"
|
||||
|
||||
|
||||
def test_empty_incidents_json_skips_reconcile(tmp_path, monkeypatch):
|
||||
"""Empty incidents.json (any world-state file failing) skips full cycle."""
|
||||
sup = _setup_supervisor(tmp_path, monkeypatch)
|
||||
_populate_desired(sup)
|
||||
|
||||
(tmp_path / "world" / "services.json").write_text("{}")
|
||||
(tmp_path / "world" / "nodes.json").write_text("{}")
|
||||
(tmp_path / "world" / "incidents.json").write_text("")
|
||||
|
||||
sup.reconcile()
|
||||
|
||||
pending = list((tmp_path / "actions" / "pending").glob("*.json"))
|
||||
assert pending == [], f"No actions expected when any state file is unreadable, got: {[p.name for p in pending]}"
|
||||
|
||||
|
||||
def test_load_actual_state_returns_false_on_empty_file(tmp_path, monkeypatch):
|
||||
"""_load_actual_state must return False (not raise) when a file is empty."""
|
||||
sup = _setup_supervisor(tmp_path, monkeypatch)
|
||||
|
||||
(tmp_path / "world" / "services.json").write_text("")
|
||||
(tmp_path / "world" / "nodes.json").write_text("{}")
|
||||
(tmp_path / "world" / "incidents.json").write_text("{}")
|
||||
|
||||
result = sup._load_actual_state()
|
||||
assert result is False
|
||||
|
||||
|
||||
def test_load_actual_state_returns_true_on_valid_files(tmp_path, monkeypatch):
|
||||
"""_load_actual_state returns True and populates actual_state on valid files."""
|
||||
sup = _setup_supervisor(tmp_path, monkeypatch)
|
||||
|
||||
services = {"vps/outline": {"node": "vps", "service": "outline", "status": "healthy"}}
|
||||
(tmp_path / "world" / "services.json").write_text(json.dumps(services))
|
||||
(tmp_path / "world" / "nodes.json").write_text('{"vps": {"status": "online"}}')
|
||||
(tmp_path / "world" / "incidents.json").write_text("{}")
|
||||
|
||||
result = sup._load_actual_state()
|
||||
assert result is True
|
||||
assert "vps/outline" in sup.actual_state["services"]
|
||||
|
||||
|
||||
def test_parse_failure_preserves_last_known_good_state(tmp_path, monkeypatch):
|
||||
"""When a file becomes unreadable, actual_state retains the previous good values."""
|
||||
sup = _setup_supervisor(tmp_path, monkeypatch)
|
||||
|
||||
# First successful load
|
||||
services = {"vps/outline": {"node": "vps", "service": "outline", "status": "healthy"}}
|
||||
(tmp_path / "world" / "services.json").write_text(json.dumps(services))
|
||||
(tmp_path / "world" / "nodes.json").write_text("{}")
|
||||
(tmp_path / "world" / "incidents.json").write_text("{}")
|
||||
assert sup._load_actual_state() is True
|
||||
assert "vps/outline" in sup.actual_state["services"]
|
||||
|
||||
# File becomes empty (race condition)
|
||||
(tmp_path / "world" / "services.json").write_text("")
|
||||
assert sup._load_actual_state() is False
|
||||
|
||||
# State must be unchanged from the previous good load
|
||||
assert "vps/outline" in sup.actual_state["services"], \
|
||||
"Last-known-good state must be preserved on parse failure"
|
||||
|
||||
|
||||
def test_healthy_service_does_not_generate_action(tmp_path, monkeypatch):
|
||||
"""A desired service that appears healthy in world state generates no action."""
|
||||
sup = _setup_supervisor(tmp_path, monkeypatch)
|
||||
_populate_desired(sup)
|
||||
|
||||
services = {"vps/outline": {"node": "vps", "service": "outline", "status": "healthy"}}
|
||||
(tmp_path / "world" / "services.json").write_text(json.dumps(services))
|
||||
(tmp_path / "world" / "nodes.json").write_text("{}")
|
||||
(tmp_path / "world" / "incidents.json").write_text("{}")
|
||||
|
||||
sup.reconcile()
|
||||
|
||||
pending = list((tmp_path / "actions" / "pending").glob("*.json"))
|
||||
assert pending == [], "Healthy service must not generate any action"
|
||||
Some files were not shown because too many files have changed in this diff Show more
Loading…
Reference in a new issue