Compare commits
No commits in common. "master" and "runtime-event-system" have entirely different histories.
master
...
runtime-ev
|
|
@ -1,43 +0,0 @@
|
||||||
---
|
|
||||||
name: deploy
|
|
||||||
description: Deploy, redeploy, or ship homelab services to a target node. Trigger on any request containing deploy / redeploy / wdróż / zredeployuj / ship for targets control-plane, vps, piha, solaria, or chelsty-infra.
|
|
||||||
---
|
|
||||||
|
|
||||||
Always invoke `scripts/deploy/deploy.sh <target> [--dry-run] [--no-gate]` as the **sole entry point**.
|
|
||||||
Never call `deploy-control-plane.sh`, `deploy-node.sh`, or `deploy-local.sh` directly.
|
|
||||||
|
|
||||||
## Targets
|
|
||||||
|
|
||||||
| Target | What it deploys |
|
|
||||||
|---|---|
|
|
||||||
| `control-plane` | observer, supervisor, executor, operator-ui on VPS |
|
|
||||||
| `vps` | all VPS GitOps services (node-agent, npm, outline, joplin, ai-cluster, …) |
|
|
||||||
| `piha` | PIHA services (ha-diag-agent, node-agent, redis, …) |
|
|
||||||
| `solaria` | SOLARIA compute services |
|
|
||||||
| `chelsty-infra` | CHELSTY LTE edge node (30 s SSH timeout) |
|
|
||||||
|
|
||||||
## Invocation
|
|
||||||
|
|
||||||
```bash
|
|
||||||
scripts/deploy/deploy.sh <target> # full pipeline
|
|
||||||
scripts/deploy/deploy.sh <target> --dry-run # preflight + gate only
|
|
||||||
scripts/deploy/deploy.sh <target> --no-gate # emergency: bypass tests
|
|
||||||
```
|
|
||||||
|
|
||||||
## Exit Code Handling
|
|
||||||
|
|
||||||
| Code | Meaning | Required action |
|
|
||||||
|---|---|---|
|
|
||||||
| 0 | Success | Report: target, commit hash, gate status, verify status, elapsed time |
|
|
||||||
| 1 | Preflight failed | Fix the upstream issue (push commits, wake node, switch to master). Never bypass. |
|
|
||||||
| 2 | Gate failed | Show exactly which test/build failed. Do **not** deploy. Fix the failure first. |
|
|
||||||
| 3 | Execute failed | Show full deploy output. Ask user whether to investigate or rollback. |
|
|
||||||
| 4 | Verify failed | Show docker ps output. Discuss rollback with the user. |
|
|
||||||
| 5 | Sudo handoff | Print the exact manual command from stderr **verbatim** and stop. User must run it. |
|
|
||||||
|
|
||||||
## Rules
|
|
||||||
|
|
||||||
- Never pass `--no-gate` unless the user explicitly requests emergency/bypass mode.
|
|
||||||
- Never deploy uncommitted or unpushed code — preflight enforces this; do not help circumvent it.
|
|
||||||
- Canonical branch is `master` — preflight enforces this.
|
|
||||||
- For exit 5: reproduce the handoff command exactly as printed to stderr, then stop.
|
|
||||||
|
|
@ -1,65 +0,0 @@
|
||||||
---
|
|
||||||
name: save-session
|
|
||||||
description: Save and record the current work session to docs/sessions/. Trigger ONLY on explicit "save session", "zapisz sesję", or "wrap up" — never invoke proactively between tasks.
|
|
||||||
---
|
|
||||||
|
|
||||||
**Trigger condition**: user explicitly says "save session", "zapisz sesję", "wrap up", or equivalent.
|
|
||||||
Never invoke proactively. Never invoke mid-task.
|
|
||||||
|
|
||||||
## 1. Determine Session Boundary
|
|
||||||
|
|
||||||
1. Read the latest entry file in `docs/sessions/` — use its last `## Session HH:MM` heading timestamp as the start boundary.
|
|
||||||
2. Fallback if no previous entry exists: 24 hours ago.
|
|
||||||
|
|
||||||
## 2. Collect Facts (deterministic only — no invention)
|
|
||||||
|
|
||||||
Run exactly:
|
|
||||||
```bash
|
|
||||||
# All commits since boundary
|
|
||||||
git --no-pager log --oneline <boundary>..HEAD
|
|
||||||
|
|
||||||
# Changed file summary
|
|
||||||
git --no-pager diff --stat <boundary>..HEAD
|
|
||||||
```
|
|
||||||
|
|
||||||
From the visible conversation transcript: deploys run and their outcomes, test results seen.
|
|
||||||
|
|
||||||
## 3. Write the Session Entry
|
|
||||||
|
|
||||||
**APPEND** to `docs/sessions/YYYY-MM-DD.md` (create the file if it doesn't exist for today).
|
|
||||||
Never overwrite existing content.
|
|
||||||
|
|
||||||
```markdown
|
|
||||||
## Session HH:MM
|
|
||||||
|
|
||||||
### Commits
|
|
||||||
<output of git log --oneline>
|
|
||||||
|
|
||||||
### Files changed
|
|
||||||
<output of git diff --stat>
|
|
||||||
|
|
||||||
### Deploys
|
|
||||||
<list from transcript, or "None recorded">
|
|
||||||
|
|
||||||
### Narrative
|
|
||||||
> _user-provided summary_
|
|
||||||
```
|
|
||||||
|
|
||||||
The `> _user-provided summary_` placeholder is **mandatory**. Never fill it in. The user supplies the narrative separately if desired.
|
|
||||||
|
|
||||||
## 4. What NOT to Touch
|
|
||||||
|
|
||||||
- `backlog.md` — only on explicit "update backlog" instruction
|
|
||||||
- `CLAUDE.md` — only on explicit "update CLAUDE.md" instruction
|
|
||||||
- Any other file not listed above
|
|
||||||
|
|
||||||
## 5. Commit
|
|
||||||
|
|
||||||
Stage and commit **only** the session file:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
git add docs/sessions/YYYY-MM-DD.md
|
|
||||||
git commit -m "docs: session YYYY-MM-DD HH:MM"
|
|
||||||
```
|
|
||||||
|
|
||||||
No other files. No `git add -A`.
|
|
||||||
|
|
@ -1,81 +0,0 @@
|
||||||
---
|
|
||||||
name: worktree-aware
|
|
||||||
description: >
|
|
||||||
Use when working in a git worktree checkout for a parallel agent task.
|
|
||||||
The presence of an .agent-task file in the current working directory indicates
|
|
||||||
a task worktree (NOT the main checkout). Encodes branch hygiene: commit only
|
|
||||||
to the assigned task branch, NEVER push origin master, NEVER touch the main
|
|
||||||
checkout at ~/homelab-codex-ws, NEVER manage worktrees yourself. On task
|
|
||||||
completion, report the branch name verbatim and stop — the human merges via
|
|
||||||
scripts/dev/agent.sh.
|
|
||||||
---
|
|
||||||
|
|
||||||
## When this applies
|
|
||||||
|
|
||||||
- `.agent-task` present in your `cwd` → you are in a task worktree. Apply all rules below.
|
|
||||||
- `.agent-task` absent → you are in the main checkout. Do NOT treat yourself as a task agent.
|
|
||||||
In the main checkout these rules do not apply.
|
|
||||||
|
|
||||||
## Reading the marker
|
|
||||||
|
|
||||||
`.agent-task` is a YAML file. Your assigned branch is the value of the `branch:` key, e.g.:
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
task: my-feature
|
|
||||||
branch: task/my-feature
|
|
||||||
parent_commit: abc1234
|
|
||||||
created_utc: 2026-06-03T10:00:00Z
|
|
||||||
worktree_path: /home/oskar/homelab-codex-ws-my-feature
|
|
||||||
```
|
|
||||||
|
|
||||||
Always read this file first before taking any action.
|
|
||||||
|
|
||||||
## Rules
|
|
||||||
|
|
||||||
1. **Commit only to your branch.**
|
|
||||||
Before any `git commit`, run `git status` and confirm it says `On branch task/<name>`.
|
|
||||||
If it does not, stop immediately and report the discrepancy.
|
|
||||||
|
|
||||||
2. **Push only to your branch.**
|
|
||||||
The only permitted push is `git push origin task/<name>`.
|
|
||||||
NEVER `git push origin master` or any other branch.
|
|
||||||
|
|
||||||
3. **Do not touch the main checkout.**
|
|
||||||
`~/homelab-codex-ws/` is the main checkout — deploy-only, owned by the human.
|
|
||||||
Do not read from, write to, or execute commands inside it.
|
|
||||||
|
|
||||||
4. **Stay scoped.**
|
|
||||||
Only change files directly related to your assigned task.
|
|
||||||
If you notice other problems, report them in your final summary as separate follow-up proposals.
|
|
||||||
Do not fix them in this worktree.
|
|
||||||
|
|
||||||
5. **Never `git add -A`.**
|
|
||||||
Always stage specific files by name: `git add path/to/file`.
|
|
||||||
|
|
||||||
6. **Do not manage worktrees.**
|
|
||||||
Never run `git worktree add/remove` or invoke `scripts/dev/agent.sh`.
|
|
||||||
Worktree lifecycle is the human's responsibility.
|
|
||||||
|
|
||||||
7. **Final report before stopping.**
|
|
||||||
When the task is done, provide a structured report containing:
|
|
||||||
- Files changed (path and one-line summary of change)
|
|
||||||
- Tests run and results
|
|
||||||
- All commit hashes on the task branch
|
|
||||||
- **Branch name verbatim** (copy-paste ready)
|
|
||||||
- Follow-up items as bulleted proposals for separate tasks
|
|
||||||
|
|
||||||
## Definition of Done
|
|
||||||
|
|
||||||
- All commits are on `task/<name>` (verify with `git log --oneline master..task/<name>`)
|
|
||||||
- Test suite passes
|
|
||||||
- Branch pushed: `git push origin task/<name>`
|
|
||||||
- Full report delivered in conversation
|
|
||||||
|
|
||||||
## What you do NOT do
|
|
||||||
|
|
||||||
- Merge branches
|
|
||||||
- Create or push tags
|
|
||||||
- Run deploys or healthchecks against production nodes
|
|
||||||
- Delete branches or worktrees
|
|
||||||
- Modify files in other worktrees
|
|
||||||
- Push to `origin master` under any circumstances
|
|
||||||
1
.gitignore
vendored
1
.gitignore
vendored
|
|
@ -15,7 +15,6 @@ __pycache__/
|
||||||
*$py.class
|
*$py.class
|
||||||
venv/
|
venv/
|
||||||
.venv/
|
.venv/
|
||||||
*.egg-info/
|
|
||||||
|
|
||||||
# Tools
|
# Tools
|
||||||
.aider*
|
.aider*
|
||||||
|
|
|
||||||
194
CLAUDE.md
194
CLAUDE.md
|
|
@ -1,194 +0,0 @@
|
||||||
# CLAUDE.md
|
|
||||||
|
|
||||||
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
|
||||||
|
|
||||||
## What This Repo Is
|
|
||||||
|
|
||||||
GitOps-lite orchestration for a distributed homelab. The repo is the source of truth for infrastructure definitions; runtime state lives at `/opt/homelab/` on each execution node and is never committed.
|
|
||||||
|
|
||||||
## Node Roles
|
|
||||||
|
|
||||||
| Host | Role |
|
|
||||||
|------|------|
|
|
||||||
| **SATURN** | Primary control node — only node where commits are made |
|
|
||||||
| **SOLARIA** | GPU/compute/AI workloads |
|
|
||||||
| **PIHA** | Infra, monitoring |
|
|
||||||
| **VPS** | Public ingress, reverse proxy, control plane host |
|
|
||||||
| **CHELSTY-INFRA** | LTE edge hypervisor (site: chelsty); Zigbee2MQTT, Mosquitto, stability-agent — offline-first |
|
|
||||||
| **CHELSTY-HA** | LTE Home Assistant VM (site: chelsty); connects to CHELSTY-INFRA MQTT broker — offline-first |
|
|
||||||
|
|
||||||
All nodes communicate over Tailscale. CHELSTY-INFRA and CHELSTY-HA have an intermittent LTE uplink; their services must never depend on SATURN, VPS, or Forgejo at runtime. Full node capabilities: `hosts/<node>/capabilities.yaml`.
|
|
||||||
|
|
||||||
## Deployment
|
|
||||||
|
|
||||||
```bash
|
|
||||||
scripts/deploy/deploy.sh # fresh deploy on current node
|
|
||||||
scripts/deploy/deploy.sh --resume # resume after interruption
|
|
||||||
scripts/deploy/deploy.sh --stage verify # specific stage only
|
|
||||||
scripts/deploy/deploy.sh --service mosquitto # specific service only
|
|
||||||
./scripts/deploy/deploy-control-plane.sh --ssh # SATURN/SOLARIA → VPS
|
|
||||||
./scripts/deploy/deploy-node.sh chelsty-infra # CHELSTY nodes (individually)
|
|
||||||
./scripts/bootstrap/prepare-node.sh # general node bootstrap
|
|
||||||
./scripts/bootstrap/chelsty-runtime.sh # CHELSTY-specific bootstrap
|
|
||||||
```
|
|
||||||
|
|
||||||
Pipeline stages: **prepare → validate → deploy → verify → diagnose (on failure) → complete**. Stage state persisted in `/opt/homelab/state/deploy/`.
|
|
||||||
|
|
||||||
## Service Structure
|
|
||||||
|
|
||||||
Every service must follow this layout:
|
|
||||||
|
|
||||||
```
|
|
||||||
services/<service>/
|
|
||||||
├── docker-compose.yml
|
|
||||||
├── service.yaml # Machine-readable contract (primary source of truth for agents)
|
|
||||||
├── README.md
|
|
||||||
├── env.example # Template — never commit actual secrets
|
|
||||||
└── healthcheck.sh # Returns 0 (healthy) or 1 (unhealthy)
|
|
||||||
```
|
|
||||||
|
|
||||||
`service.yaml` defines `owner_node`, `exposure`, `dependencies`, `healthcheck`, `restart_policy`, `persistence.paths`, and `runtime.env_vars`. This is what AI agents read to understand how to manage a service.
|
|
||||||
|
|
||||||
Host-specific runtime config and secrets live at `/opt/homelab/config/<service>/` on the target node (not in Git). Docker Compose overrides are version-controlled at `hosts/<node>/runtime/<service>/docker-compose.override.yml` in this repo and applied during deployment.
|
|
||||||
|
|
||||||
## Agent System Architecture
|
|
||||||
|
|
||||||
The platform uses a multi-agent model with **human-in-the-loop** for destructive actions:
|
|
||||||
|
|
||||||
1. **Stability Agent** (`services/stability-agent/`) — Per-node watchdog. Monitors Docker containers, disk, Tailscale, MQTT. Emits filesystem events. Does NOT restart services autonomously.
|
|
||||||
2. **Observer** (`services/control-plane/src/`) — Synthesizes world state from events into `/opt/homelab/world/{nodes,services,deployments,incidents}.json`.
|
|
||||||
3. **Supervisor** — Detects drift between desired state (from `hosts/*/services.yaml`) and actual state (from Observer output). Writes `pending` action JSON files.
|
|
||||||
4. **Executor** — Executes actions only after they transition to `approved`.
|
|
||||||
5. **Operator UI** + **Telegram Bot** — Operators review and approve/reject pending actions.
|
|
||||||
|
|
||||||
### Action approval flow
|
|
||||||
```
|
|
||||||
Agent → /opt/homelab/actions/pending/<id>.json
|
|
||||||
→ Telegram notification → Operator approves
|
|
||||||
→ /opt/homelab/actions/approved/<id>.json
|
|
||||||
→ Executor runs → completed / failed
|
|
||||||
```
|
|
||||||
|
|
||||||
Agents must never execute destructive actions (restarts, deploys, config changes) without a corresponding approved action file.
|
|
||||||
|
|
||||||
## Event System
|
|
||||||
|
|
||||||
Events are append-only JSON lines at `/opt/homelab/events/YYYY-MM-DD/<node>/events.jsonl`.
|
|
||||||
|
|
||||||
Emit via `scripts/lib/events.sh` (shell) or `scripts/lib/events.py` (Python).
|
|
||||||
|
|
||||||
Normalized event types: `deployment_started/completed/failed`, `service_unhealthy/recovered`, `node_offline/online`, `healthcheck_failed`, `remediation_started/completed`.
|
|
||||||
|
|
||||||
### Supervisor event routing table
|
|
||||||
|
|
||||||
| Event type | Source | Action generated | Cooldown |
|
|
||||||
|---|---|---|---|
|
|
||||||
| `containers_not_running` | stability-agent | `container_restart` | dedup via stable ID |
|
|
||||||
| `mqtt_unreachable` | stability-agent | `container_restart` | dedup via stable ID |
|
|
||||||
| `service_unhealthy` / other | stability-agent | `redeploy` | dedup via stable ID |
|
|
||||||
| `disk_pressure` (high) | stability-agent | `disk_cleanup` | dedup via stable ID |
|
|
||||||
| `ha_websocket_dead` | ha-diag-agent | `container_restart` (homeassistant) | 30 min after completion |
|
|
||||||
| `ha_websocket_recovered` | ha-diag-agent | cancels matching restart | — |
|
|
||||||
| `ha_integration_failed` | ha-diag-agent | `alert_only` | 1 hour |
|
|
||||||
| `ha_entity_unavailable_long` | ha-diag-agent | `alert_only` | 1 hour |
|
|
||||||
| `ha_automation_failing` | ha-diag-agent | `alert_only` | 1 hour |
|
|
||||||
| `ha_update_available` | ha-diag-agent | `alert_only` | 1 hour |
|
|
||||||
| `ha_recorder_lag` | ha-diag-agent | `alert_only` | 1 hour |
|
|
||||||
| `ha_system_health_degraded` | ha-diag-agent | `alert_only` | 1 hour |
|
|
||||||
|
|
||||||
HA events are routed directly from the events directory by the supervisor (not via world-state drift loop) to avoid conflicts with stability-agent's independent container health tracking. HA events are suppressed if `homeassistant` had a `containers_not_running` incident within the last 5 minutes (planned restart/update in progress).
|
|
||||||
|
|
||||||
## Discovery Entry Points for Agents
|
|
||||||
|
|
||||||
When exploring the system, use these files in order:
|
|
||||||
1. `inventory/topology.yaml` — node list, roles, mesh type
|
|
||||||
2. `hosts/<node>/capabilities.yaml` — hardware and software constraints
|
|
||||||
3. `hosts/<node>/services.yaml` — desired services and exposure classes for that host
|
|
||||||
4. `services/<service>/service.yaml` — operational contract for a service
|
|
||||||
|
|
||||||
## VPS-Specific Rules
|
|
||||||
|
|
||||||
VPS has **4 GiB RAM, no swap**. Every repo-managed service must declare memory limits in its `hosts/vps/runtime/<service>/docker-compose.override.yml`.
|
|
||||||
|
|
||||||
### Memory limit convention
|
|
||||||
|
|
||||||
Use top-level Compose properties (not `deploy.resources.limits`, which requires Swarm mode):
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
services:
|
|
||||||
myservice:
|
|
||||||
mem_limit: 256m # cgroup ceiling; Docker restarts on breach
|
|
||||||
oom_score_adj: -900 # host kernel OOM-killer will not pick this container
|
|
||||||
```
|
|
||||||
|
|
||||||
Rules:
|
|
||||||
- **Control-plane containers** (executor, observer, supervisor, operator-ui), **node-agent**, **stability-agent**: always set `oom_score_adj: -900` — these must never be a system-level OOM victim.
|
|
||||||
- `mem_limit` still applies even with `oom_score_adj: -900`; the cgroup OOM killer is independent of the host OOM killer and will restart the container via Docker when the limit is exceeded.
|
|
||||||
- Budget: OS+Docker reserves ~800 MiB; sum of all `mem_limit` values must stay ≤ 3200 MiB (3.1 GiB).
|
|
||||||
|
|
||||||
### Repo-managed services on VPS
|
|
||||||
|
|
||||||
All VPS services are now GitOps-managed. Service definitions live in `services/<name>/docker-compose.yml`; host-specific overrides (mem_limit, env) live in `hosts/vps/runtime/<name>/docker-compose.override.yml`.
|
|
||||||
|
|
||||||
| Service | Compose stack | Data path |
|
|
||||||
|---|---|---|
|
|
||||||
| npm | `services/npm/` | `/home/dockeruser/docker/npm/{data,letsencrypt}` (bind mount) |
|
|
||||||
| outline | `services/outline/` | Docker named volumes: `outline_outline_storage`, `outline_postgres_data`, `outline_redis_data` |
|
|
||||||
| joplin | `services/joplin/` | Docker named volume: `joplin_postgres_data` |
|
|
||||||
| ai-cluster | `services/ai-cluster/` | Mosquitto config bind: `/home/dockeruser/docker/ai-cluster/mosquitto/` |
|
|
||||||
|
|
||||||
**Data migration rule**: data paths stay in place at cutover. Never move volumes or bind-mount sources without a dedicated migration plan.
|
|
||||||
|
|
||||||
**Cutover checklist** (before running `docker compose up` for any migrated service):
|
|
||||||
1. `git pull` on VPS
|
|
||||||
2. Populate `/opt/homelab/config/<service>/.env` from the `env.example` template
|
|
||||||
3. For ai-cluster: copy `/home/dockeruser/docker/ai-cluster/.env` to `/opt/homelab/config/ai-cluster/.env`
|
|
||||||
4. For mosquitto: config stays at old bind path until explicitly migrated
|
|
||||||
5. Verify named volumes exist: `docker volume ls | grep <project>`
|
|
||||||
|
|
||||||
**ai-cluster architectural note**: compute workloads (codex-worker, planner-worker) belong on SOLARIA (GPU/compute node), not the 4 GB ingress VPS. Migrate when feasible; for now, hard mem_limits contain the blast radius.
|
|
||||||
|
|
||||||
## CHELSTY-Specific Rules
|
|
||||||
|
|
||||||
- Zigbee coordinator is **SLZB-06U** over TCP (`192.168.1.105:6638`, `ezsp` adapter). Never use `/dev/ttyUSB0`.
|
|
||||||
- CHELSTY nodes run **docker-compose v1** (1.29.2) — use `docker-compose` (hyphenated), not `docker compose`.
|
|
||||||
- Critical backup sets: HA config+data, Zigbee2MQTT config+db+network key, Mosquitto config+persistence, SLZB-06U coordinator state.
|
|
||||||
|
|
||||||
## Runtime Path Conventions
|
|
||||||
|
|
||||||
`/opt/homelab/` layout on each node:
|
|
||||||
|
|
||||||
- `data/<service>/` — persistent volumes
|
|
||||||
- `config/<service>/` — secrets and host-local overrides (not in Git)
|
|
||||||
- `logs/<service>/` — service logs
|
|
||||||
- `state/` — deployment stage markers, agent heartbeats
|
|
||||||
- `events/` — append-only event store
|
|
||||||
- `world/` — Observer output (synthesized state)
|
|
||||||
- `actions/` — pending / approved / running / completed / failed
|
|
||||||
|
|
||||||
## Definition of Done (serwisy)
|
|
||||||
|
|
||||||
Before any new or changed service is considered ready:
|
|
||||||
|
|
||||||
1. **docker build + smoke run** — build the image locally and run it for a few seconds; confirm the process starts its main loop without crashing. This catches packaging/import errors (e.g. `ModuleNotFoundError`) before they reach a node.
|
|
||||||
2. **pytest** — run the service's test suite. If no tests exist yet, add a minimal one (at minimum: import passes, core logic has at least one case). Tests live in `services/<service>/tests/`.
|
|
||||||
3. **Never commit or deploy code that has never been run.** If a smoke run or test fails, fix it first.
|
|
||||||
|
|
||||||
## Naming Conventions
|
|
||||||
|
|
||||||
- Hosts: ALL CAPS (`SATURN`, `PIHA`)
|
|
||||||
- Services: kebab-case (`stability-agent`, `zigbee2mqtt`)
|
|
||||||
- Container names must match service names
|
|
||||||
- Always `restart: unless-stopped` unless `service.yaml` says otherwise
|
|
||||||
|
|
||||||
## Multi-agent worktree mode
|
|
||||||
|
|
||||||
`~/homelab-codex-ws` (main checkout) is **deploy-only** and belongs to the human operator.
|
|
||||||
Parallel agent tasks run in isolated git worktrees created by `scripts/dev/agent.sh new <name>`.
|
|
||||||
|
|
||||||
If `.agent-task` exists in your current working directory, you are in a task worktree.
|
|
||||||
**You must immediately read `.agent-task` and load `.claude/skills/worktree-aware/SKILL.md`
|
|
||||||
before taking any action.** That skill defines all branch-hygiene rules for task worktrees.
|
|
||||||
|
|
||||||
Worktree lifecycle commands: `agent.sh new | list | merge | clean`.
|
|
||||||
Agents never invoke these — only the human does.
|
|
||||||
19
README.md
19
README.md
|
|
@ -13,22 +13,6 @@ The homelab consists of several nodes connected via a Tailscale internal mesh.
|
||||||
| **PIHA** | Infra Node | Core infrastructure services, automation, and monitoring. |
|
| **PIHA** | Infra Node | Core infrastructure services, automation, and monitoring. |
|
||||||
| **VPS** | Edge Node | Public ingress, reverse proxy, and edge services. |
|
| **VPS** | Edge Node | Public ingress, reverse proxy, and edge services. |
|
||||||
|
|
||||||
## Agent System
|
|
||||||
|
|
||||||
The homelab uses a multi-agent orchestration model with human-in-the-loop for destructive actions:
|
|
||||||
|
|
||||||
| Agent | Node | Role |
|
|
||||||
|-------|------|------|
|
|
||||||
| **stability-agent** | all nodes | Per-node watchdog — monitors Docker, disk, Tailscale, MQTT; emits events |
|
|
||||||
| **node-agent** | all nodes | Publishes container health events to Redis pub/sub |
|
|
||||||
| **observer** | VPS | Synthesizes world state from events into `/opt/homelab/world/*.json` |
|
|
||||||
| **supervisor** | VPS | Detects drift between desired and actual state; writes `pending` actions |
|
|
||||||
| **planner-agent** | SOLARIA | LLM-powered diagnosis — listens to Redis, proposes remediation actions |
|
|
||||||
| **executor** | VPS | Executes actions only after operator approval |
|
|
||||||
| **operator-ui** + **telegram-bot** | VPS / PIHA | Operator reviews and approves/rejects pending actions |
|
|
||||||
|
|
||||||
Action approval flow: `pending/` → operator approves → `approved/` → executor runs.
|
|
||||||
|
|
||||||
## Repository Structure
|
## Repository Structure
|
||||||
|
|
||||||
- `docs/`: [Infrastructure Standards](docs/standards.md) and [Deployment Conventions](docs/deployment.md).
|
- `docs/`: [Infrastructure Standards](docs/standards.md) and [Deployment Conventions](docs/deployment.md).
|
||||||
|
|
@ -45,13 +29,10 @@ Action approval flow: `pending/` → operator approves → `approved/` → execu
|
||||||
## Documentation Index
|
## Documentation Index
|
||||||
|
|
||||||
- [Infrastructure Standards](docs/standards.md)
|
- [Infrastructure Standards](docs/standards.md)
|
||||||
- [Agent Operating Procedures](docs/agents.md) (For AI/Non-Human Agents)
|
|
||||||
- [Deployment Conventions](docs/deployment.md)
|
- [Deployment Conventions](docs/deployment.md)
|
||||||
- [Hardware](docs/hardware.md)
|
- [Hardware](docs/hardware.md)
|
||||||
- [Networking](docs/networking.md)
|
- [Networking](docs/networking.md)
|
||||||
- [Services](docs/services.md)
|
- [Services](docs/services.md)
|
||||||
- [Node Capabilities](docs/capabilities.md)
|
|
||||||
- [Action Model](services/agent-system/action-model.md)
|
|
||||||
|
|
||||||
---
|
---
|
||||||
*Note: This repository documents the state of the homelab. Runtime state lives outside the repository in `/opt/homelab`.*
|
*Note: This repository documents the state of the homelab. Runtime state lives outside the repository in `/opt/homelab`.*
|
||||||
|
|
|
||||||
|
|
@ -1,31 +0,0 @@
|
||||||
{
|
|
||||||
"metadata": {
|
|
||||||
"format": "zigpy/open-coordinator-backup",
|
|
||||||
"version": 1,
|
|
||||||
"source": "zigbee-herdsman@10.0.7",
|
|
||||||
"internal": {
|
|
||||||
"date": "2026-05-14T14:48:35.098Z",
|
|
||||||
"znpVersion": 1
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"stack_specific": {
|
|
||||||
"zstack": {
|
|
||||||
"tclk_seed": "32d69cbe3f0e15471e5d43f9401e485a"
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"coordinator_ieee": "00124b00257bf416",
|
|
||||||
"pan_id": "46bc",
|
|
||||||
"extended_pan_id": "087730b5f614ea4a",
|
|
||||||
"nwk_update_id": 0,
|
|
||||||
"security_level": 5,
|
|
||||||
"channel": 11,
|
|
||||||
"channel_mask": [
|
|
||||||
11
|
|
||||||
],
|
|
||||||
"network_key": {
|
|
||||||
"key": "049909949a950d91522cf10cc369a724",
|
|
||||||
"sequence_number": 0,
|
|
||||||
"frame_counter": 0
|
|
||||||
},
|
|
||||||
"devices": []
|
|
||||||
}
|
|
||||||
|
|
@ -1,49 +0,0 @@
|
||||||
# Agent Operating Procedures
|
|
||||||
|
|
||||||
This document defines the operating procedures, constraints, and interaction protocols for non-human agents (AI agents, autonomous scripts) within the Homelab Codex ecosystem.
|
|
||||||
|
|
||||||
## 1. Core Principles for Agents
|
|
||||||
|
|
||||||
1. **Read-Only by Default**: Agents should assume read-only access to the `/opt/homelab` runtime unless explicitly executing an approved action.
|
|
||||||
2. **Git as Authority**: The repository on **SATURN** is the source of truth. Agents must not modify the runtime state on nodes directly without corresponding (or pending) Git state, unless it's an emergency mitigation.
|
|
||||||
3. **Human-in-the-Loop (HIL)**: All destructive or structural changes (restarts, deployments, config changes) must follow the [Action Approval Model](../services/agent-system/action-model.md).
|
|
||||||
4. **Idempotency**: All scripts and actions proposed or executed by agents MUST be idempotent.
|
|
||||||
5. **Context-Awareness**: Agents MUST read the `README.md` and `docs/agents.md` at the start of every session to align with current infrastructure standards.
|
|
||||||
|
|
||||||
## 2. Agent Roles
|
|
||||||
|
|
||||||
| Role | Responsibility | Scope |
|
|
||||||
|------|----------------|-------|
|
|
||||||
| **Observer** | Monitors health, logs, and events. | Read-only access to `/opt/homelab/events` and `logs`. |
|
|
||||||
| **Stability Agent** | Local node watchdog, event emitter. | Local node runtime, `service.yaml` healthchecks. |
|
|
||||||
| **Orchestrator** | High-level planning, workload placement. | Repository-wide, multi-node topology. |
|
|
||||||
| **Materializer** | Translates high-level intent into Docker/System state. | Execution of `approved` actions. |
|
|
||||||
|
|
||||||
## 3. Discovery Protocol
|
|
||||||
|
|
||||||
Agents must use the following entry points to understand the system:
|
|
||||||
|
|
||||||
1. **Topology**: `inventory/topology.yaml` for node list and roles.
|
|
||||||
2. **Capabilities**: `hosts/<node>/capabilities.yaml` to understand hardware/software constraints.
|
|
||||||
3. **Service Contract**: `services/<service>/service.yaml` to understand how to check health and manage a service.
|
|
||||||
4. **Operational State**: `/opt/homelab/state/` on local nodes for real-time status.
|
|
||||||
|
|
||||||
## 4. Interaction with Humans
|
|
||||||
|
|
||||||
Agents communicate with the operator via the `agent-system/telegram-bot`.
|
|
||||||
|
|
||||||
- **Alerting**: Agents emit events to the event system. Critical events are forwarded to Telegram.
|
|
||||||
- **Proposals**: When an agent identifies a need for change (e.g., "Service X is failing, suggest restart"), it creates a `pending` action in `/opt/homelab/actions/pending/`.
|
|
||||||
- **Approval**: Agents must wait for the action status to transition to `approved` before execution.
|
|
||||||
|
|
||||||
## 5. Decision Logic (Reasoning)
|
|
||||||
|
|
||||||
When making decisions, agents MUST prioritize:
|
|
||||||
1. **Safety**: Do not violate power constraints (see `capabilities.yaml`).
|
|
||||||
2. **Stability**: Prefer keeping services on their `owner_node` unless it's down.
|
|
||||||
3. **Connectivity**: On intermittent nodes (CHELSTY), avoid actions requiring heavy WAN traffic during low-signal periods.
|
|
||||||
|
|
||||||
## 6. Access Control for Agents
|
|
||||||
|
|
||||||
- **Filesystem**: Agents should run as the `homelab` user or equivalent with restricted sudo access to `docker compose`.
|
|
||||||
- **Secrets**: Agents MUST NOT attempt to read `.env` files unless specifically tasked with credential rotation. They should treat secrets as opaque handles.
|
|
||||||
|
|
@ -83,10 +83,3 @@ Future autonomous agents will use this metadata to:
|
||||||
2. **Generate Plans:** Create step-by-step deployment or migration plans based on hardware compatibility.
|
2. **Generate Plans:** Create step-by-step deployment or migration plans based on hardware compatibility.
|
||||||
3. **Validate Topology:** Ensure that a proposed multi-node setup doesn't violate networking or operational constraints (e.g., don't put a DB on an intermittent node).
|
3. **Validate Topology:** Ensure that a proposed multi-node setup doesn't violate networking or operational constraints (e.g., don't put a DB on an intermittent node).
|
||||||
4. **Propose Failover:** Automatically suggest the best alternative node during an outage.
|
4. **Propose Failover:** Automatically suggest the best alternative node during an outage.
|
||||||
|
|
||||||
## Agent Reasoning Logic
|
|
||||||
|
|
||||||
When an agent parses `capabilities.yaml`, it should apply these heuristics:
|
|
||||||
- **Intermittent Connectivity**: If `operational.connectivity == "intermittent"`, do not schedule high-bandwidth syncs or critical cloud-dependent services.
|
|
||||||
- **Power Constraints**: If `operational.power_constraint == "low-power"`, avoid heavy LLM inference or continuous high-CPU tasks.
|
|
||||||
- **Availability Target**: If `availability_target == "high"`, this node is a candidate for hosting control-plane failovers.
|
|
||||||
|
|
|
||||||
|
|
@ -1,154 +1,60 @@
|
||||||
# CHELSTY Runtime
|
# CHELSTY Runtime
|
||||||
|
|
||||||
This document describes the runtime environment and deployment flow for CHELSTY, an offline-capable home automation edge node split across two VMs.
|
This document describes the runtime environment and deployment flow for CHELSTY, an offline-capable home automation edge node.
|
||||||
|
|
||||||
| Node | Role | Services |
|
|
||||||
|------|------|----------|
|
|
||||||
| `chelsty-infra` | LTE edge hypervisor | Mosquitto, Zigbee2MQTT, stability-agent, node-agent |
|
|
||||||
| `chelsty-ha` | Home Assistant VM | homeassistant (no node-agent — see below) |
|
|
||||||
|
|
||||||
Both nodes share an LTE uplink and must function fully offline (Zigbee, MQTT, HA automations) without any connectivity to SATURN, VPS, or Forgejo.
|
|
||||||
|
|
||||||
## Runtime Layout
|
## Runtime Layout
|
||||||
|
|
||||||
```
|
The CHELSTY runtime is located at `/opt/homelab`.
|
||||||
/opt/homelab/
|
|
||||||
├── config/ # Service-specific configs and secrets (not in Git)
|
- `/opt/homelab/config/`: Service-specific configurations and compose overrides.
|
||||||
│ ├── mosquitto/
|
- `/opt/homelab/data/`: Persistent data for services.
|
||||||
│ └── zigbee2mqtt/
|
- `/opt/homelab/logs/`: Service logs.
|
||||||
├── data/ # Persistent service data
|
|
||||||
│ ├── mosquitto/ # Persistence DB, password file
|
### Key Service Locations
|
||||||
│ └── zigbee2mqtt/
|
- **Mosquitto**: `/opt/homelab/config/mosquitto/`
|
||||||
│ └── data/ # z2m config, coordinator backup, network key
|
- **Zigbee2MQTT**: `/opt/homelab/config/zigbee2mqtt/`
|
||||||
└── logs/
|
|
||||||
```
|
|
||||||
|
|
||||||
## SLZB-06U Integration
|
## SLZB-06U Integration
|
||||||
|
|
||||||
CHELSTY uses a SMLIGHT SLZB-06U Zigbee coordinator connected over Ethernet/TCP.
|
CHELSTY uses a SMLIGHT SLZB-06U Zigbee coordinator connected via Ethernet/TCP.
|
||||||
|
|
||||||
- **Coordinator IP**: `192.168.1.105`
|
- **Coordinator IP**: 192.168.1.105
|
||||||
- **Port**: `6638`
|
- **Port**: 6638
|
||||||
- **Adapter**: `ezsp` (deprecated — migration to `ember` recommended, requires only changing `adapter: ember` in `configuration.yaml`)
|
- **Protocol**: TCP (ezsp adapter)
|
||||||
- **Zigbee2MQTT config key**: `serial.port: tcp://192.168.1.105:6638`
|
|
||||||
|
|
||||||
⚠️ Never use `/dev/ttyUSB0` — the coordinator is always TCP-only on this site.
|
Zigbee2MQTT is configured to connect to this coordinator over the local network.
|
||||||
|
|
||||||
## Networking Constraints
|
## Offline & LTE Assumptions
|
||||||
|
|
||||||
### Mosquitto — `network_mode: host`
|
- **WAN Resilience**: All core automation (Zigbee, MQTT) runs locally on CHELSTY.
|
||||||
Mosquitto runs with `network_mode: host` so that all containers on the same host can reach it at `localhost:1883`. **Do not change this.**
|
- **Connectivity**: LTE provides intermittent uplink for remote management and Tailscale access.
|
||||||
|
- **Home Assistant**: Runs in a separate VM, connecting to the Mosquitto broker on CHELSTY.
|
||||||
### Zigbee2MQTT — bridge network + extra_hosts
|
|
||||||
Zigbee2MQTT runs in a bridge-networked container (needed for port mapping compatibility with docker-compose v1). To reach the host-networked Mosquitto:
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
# hosts/chelsty-infra/runtime/zigbee2mqtt/docker-compose.override.yml
|
|
||||||
services:
|
|
||||||
zigbee2mqtt:
|
|
||||||
extra_hosts:
|
|
||||||
- "mosquitto:host-gateway"
|
|
||||||
```
|
|
||||||
|
|
||||||
This maps the `mosquitto` hostname inside the z2m container to the Docker host gateway IP, so `mqtt://mosquitto:1883` reaches the host-networked Mosquitto process.
|
|
||||||
|
|
||||||
**Why not `network_mode: host` for z2m?**
|
|
||||||
chelsty-infra runs docker-compose v1 (1.29.2). In v1, `network_mode: host` cannot coexist with `ports:` declared in the base `docker-compose.yml` — raises `InvalidArgument`. The `extra_hosts` approach avoids this.
|
|
||||||
|
|
||||||
## Zigbee2MQTT Config Location
|
|
||||||
|
|
||||||
The `configuration.yaml` **must be writable** — z2m migrates and rewrites it on startup. It lives in the data directory:
|
|
||||||
|
|
||||||
```
|
|
||||||
/opt/homelab/data/zigbee2mqtt/data/configuration.yaml
|
|
||||||
```
|
|
||||||
|
|
||||||
This path is mounted read-write by the base `docker-compose.yml`:
|
|
||||||
```yaml
|
|
||||||
volumes:
|
|
||||||
- /opt/homelab/data/zigbee2mqtt/data:/app/data
|
|
||||||
```
|
|
||||||
|
|
||||||
Do **not** mount `configuration.yaml` as a separate `:ro` volume — z2m will fail with `EROFS`.
|
|
||||||
|
|
||||||
### Minimal configuration.yaml
|
|
||||||
```yaml
|
|
||||||
homeassistant: true
|
|
||||||
permit_join: false
|
|
||||||
mqtt:
|
|
||||||
base_topic: zigbee2mqtt
|
|
||||||
server: mqtt://mosquitto:1883
|
|
||||||
serial:
|
|
||||||
port: tcp://192.168.1.105:6638
|
|
||||||
adapter: ezsp
|
|
||||||
frontend:
|
|
||||||
port: 8080
|
|
||||||
advanced:
|
|
||||||
log_level: info
|
|
||||||
```
|
|
||||||
|
|
||||||
## chelsty-ha — No node-agent
|
|
||||||
|
|
||||||
`chelsty-ha` does not have a node-agent deployed. Home Assistant is monitored indirectly: if MQTT goes silent on `chelsty-infra`, HA is likely down.
|
|
||||||
|
|
||||||
In `hosts/chelsty-ha/services.yaml`:
|
|
||||||
```yaml
|
|
||||||
services:
|
|
||||||
homeassistant:
|
|
||||||
monitor: false # No node-agent; suppresses supervisor action generation
|
|
||||||
```
|
|
||||||
|
|
||||||
Remove `monitor: false` once node-agent is bootstrapped on this VM.
|
|
||||||
|
|
||||||
## Deployment Flow
|
## Deployment Flow
|
||||||
|
|
||||||
### Initial Bootstrap
|
1. **Initial Bootstrap**:
|
||||||
```bash
|
Run the bootstrap script on the CHELSTY node:
|
||||||
./scripts/bootstrap/chelsty-runtime.sh
|
```bash
|
||||||
```
|
./scripts/bootstrap/chelsty-runtime.sh
|
||||||
|
```
|
||||||
|
|
||||||
### Deploy services
|
2. **Manual Configuration**:
|
||||||
```bash
|
- Edit `/opt/homelab/config/zigbee2mqtt/.env` with MQTT credentials.
|
||||||
./scripts/deploy/deploy-node.sh chelsty-infra
|
- Add Mosquitto user:
|
||||||
./scripts/deploy/deploy-node.sh chelsty-ha
|
```bash
|
||||||
```
|
sudo mosquitto_passwd -b /opt/homelab/data/mosquitto/config/password.txt <user> <password>
|
||||||
|
```
|
||||||
|
|
||||||
### Manual (SSH) — chelsty-infra uses docker-compose v1
|
3. **Service Deployment**:
|
||||||
```bash
|
Use the staged deployment runtime:
|
||||||
ssh oskar@100.122.201.22
|
```bash
|
||||||
cd ~/homelab-codex-ws/services/<service>
|
./scripts/deploy/deploy-node.sh chelsty
|
||||||
docker-compose -f docker-compose.yml \
|
```
|
||||||
-f ../../hosts/chelsty-infra/runtime/<service>/docker-compose.override.yml \
|
|
||||||
up -d --build --force-recreate
|
|
||||||
```
|
|
||||||
|
|
||||||
> **Note:** `docker compose` (v2) is **not** available on chelsty-infra — always use `docker-compose` (hyphenated, v1 1.29.2).
|
## Recovery Procedure
|
||||||
|
|
||||||
## Recovery Procedures
|
In case of runtime failure:
|
||||||
|
1. Verify Docker and Compose plugin: `docker compose version`
|
||||||
### Mosquitto stopped
|
2. Re-run bootstrap script to ensure directory structure and basic configs.
|
||||||
```bash
|
3. Check Mosquitto logs: `tail -f /opt/homelab/data/mosquitto/log/mosquitto.log`
|
||||||
ssh oskar@100.122.201.22 "docker start mosquitto"
|
4. Verify SLZB-06U reachability: `ping 192.168.1.105`
|
||||||
# Ensure restart policy is correct:
|
|
||||||
docker update --restart unless-stopped mosquitto
|
|
||||||
```
|
|
||||||
|
|
||||||
### Zigbee2MQTT won't start
|
|
||||||
1. Check logs: `docker logs zigbee2mqtt --tail 50`
|
|
||||||
2. Verify SLZB-06U reachable from host: `nc -zv 192.168.1.105 6638`
|
|
||||||
3. Verify config is not empty: `cat /opt/homelab/data/zigbee2mqtt/data/configuration.yaml`
|
|
||||||
4. If config missing, recreate from the minimal template above
|
|
||||||
|
|
||||||
### SLZB-06U unreachable
|
|
||||||
`192.168.1.105:6638` EHOSTUNREACH means the coordinator is offline or the LAN is down. Zigbee2MQTT will keep retrying — no restart needed once the coordinator returns.
|
|
||||||
|
|
||||||
## Critical Backup Sets
|
|
||||||
|
|
||||||
| Data | Path |
|
|
||||||
|------|------|
|
|
||||||
| HA config + DB | `/opt/homelab/data/homeassistant/` on chelsty-ha |
|
|
||||||
| z2m config + coordinator backup + network key | `/opt/homelab/data/zigbee2mqtt/data/` |
|
|
||||||
| Mosquitto persistence + password file | `/opt/homelab/data/mosquitto/` |
|
|
||||||
| SLZB-06U coordinator state | Backup via SLZB-06U web UI at `192.168.1.105` |
|
|
||||||
|
|
||||||
> ⚠️ The Zigbee network key is in `configuration.yaml` or `coordinator_backup.json` — losing it requires re-pairing all devices.
|
|
||||||
|
|
|
||||||
|
|
@ -1,42 +0,0 @@
|
||||||
### CHELSTY Stability Agent
|
|
||||||
|
|
||||||
The stability-agent on CHELSTY provides local observability and health monitoring for the node's services and infrastructure.
|
|
||||||
|
|
||||||
#### Purpose
|
|
||||||
|
|
||||||
It acts as a filesystem-first watchdog that detects anomalies in the local runtime environment without taking autonomous destructive actions (like restarts). It serves as the primary data source for node-level stability metrics.
|
|
||||||
|
|
||||||
#### Monitoring Scope
|
|
||||||
|
|
||||||
* **Docker Containers**: Monitors all local containers. If a container is not in the `running` state, a `containers_not_running` event is generated.
|
|
||||||
* **Disk Usage**: Monitors the root filesystem. Generates `disk_usage_high` events if usage exceeds the configured threshold.
|
|
||||||
* **Connectivity**:
|
|
||||||
* Checks if the Tailscale socket or interface is available.
|
|
||||||
* Checks reachability of the local Mosquitto MQTT broker.
|
|
||||||
* **Zigbee2MQTT**: Specifically tracks the presence and status of the Zigbee2MQTT service.
|
|
||||||
|
|
||||||
#### Storage and Integration
|
|
||||||
|
|
||||||
* **Heartbeat**: Updated every cycle at `/opt/homelab/state/stability-agent.heartbeat`.
|
|
||||||
* **State Summary**: A JSON summary of all latest checks at `/opt/homelab/state/stability-agent.json`.
|
|
||||||
* **Events**: Append-only JSON lines at `/opt/homelab/events/YYYY-MM-DD/chelsty-infra/events.jsonl`.
|
|
||||||
|
|
||||||
#### Deployment
|
|
||||||
|
|
||||||
The service is deployed via Docker Compose on CHELSTY.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
cd services/stability-agent
|
|
||||||
docker compose up -d
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Configuration
|
|
||||||
|
|
||||||
Configuration is managed via environment variables in `docker-compose.override.yml` on the host.
|
|
||||||
|
|
||||||
| Variable | Description | Default |
|
|
||||||
|----------|-------------|---------|
|
|
||||||
| `STABILITY_CHECK_INTERVAL` | Seconds between checks | `60` |
|
|
||||||
| `DISK_THRESHOLD_PCT` | Disk usage alert threshold | `90` |
|
|
||||||
| `MQTT_HOST` | MQTT broker hostname | `mosquitto` |
|
|
||||||
| `MQTT_PORT` | MQTT broker port | `1883` |
|
|
||||||
|
|
@ -1,98 +0,0 @@
|
||||||
# Observer Runtime
|
|
||||||
|
|
||||||
The Observer Runtime is a lightweight agent responsible for synthesizing the operational world state of the homelab from raw events, logs, and state files.
|
|
||||||
|
|
||||||
## Architecture
|
|
||||||
|
|
||||||
The observer follows a filesystem-first approach, consuming append-only events and generating a normalized world model. It is designed to be idempotent, resumable, and resilient to intermittent node connectivity.
|
|
||||||
|
|
||||||
### Inputs
|
|
||||||
- `/opt/homelab/events/`: Normalized JSON events (one `.json` file per event, organized by date and node).
|
|
||||||
- `/opt/homelab/state/observer_checkpoint.json`: Per-node checkpoint dict (see below).
|
|
||||||
- Repository Inventory: `inventory/topology.yaml` and `hosts/*/services.yaml`.
|
|
||||||
|
|
||||||
### World Model Output
|
|
||||||
Generated under `/opt/homelab/world/`:
|
|
||||||
- `nodes.json`: Current node availability, roles, disk/memory pressure, last seen timestamps. Dict keyed by node name.
|
|
||||||
- `services.json`: Service health status and links to active incidents. Dict keyed by `"node/service"`.
|
|
||||||
- `deployments.json`: Tracking of active and historical deployment runs by `correlation_id`.
|
|
||||||
- `incidents.json`: Correlated operational issues, including repeat failures and resolution status.
|
|
||||||
- `runtime-summary.json`: High-level overview for dashboards and planner agents.
|
|
||||||
|
|
||||||
## Checkpoint Format
|
|
||||||
|
|
||||||
The observer tracks per-node progress to avoid silently skipping event directories:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"node_checkpoints": {
|
|
||||||
"vps": "/opt/homelab/events/2026-05-27/vps/evt-vps-1234.json",
|
|
||||||
"piha": "/opt/homelab/events/2026-05-27/piha/evt-piha-5678.json",
|
|
||||||
"chelsty-infra": "/opt/homelab/events/2026-05-27/chelsty-infra/evt-chelsty-infra-9012.json"
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
A single global checkpoint (`last_processed_file`) was replaced with this per-node dict because the old approach silently skipped any node directory that sorts alphabetically before the last-seen node (e.g. `piha/` would be skipped when the checkpoint pointed to `vps/`).
|
|
||||||
|
|
||||||
**Reset:** Delete `/opt/homelab/state/observer_checkpoint.json`. The observer will reprocess all events and rebuild world state from scratch.
|
|
||||||
|
|
||||||
## Event Types
|
|
||||||
|
|
||||||
### Negative events (create/escalate incidents)
|
|
||||||
- `service_unhealthy`, `healthcheck_failed` — open or increment an active incident
|
|
||||||
- `deployment_failed` — record failure in deployments.json
|
|
||||||
|
|
||||||
### Positive events (resolve state)
|
|
||||||
- `service_healthy` — marks service status as `healthy` **and** resolves any active incident for that service
|
|
||||||
- `service_recovered` — alias, same effect
|
|
||||||
- `deployment_completed` — marks deployment as completed
|
|
||||||
|
|
||||||
### Node events
|
|
||||||
- `node_online`, `node_offline` — update node status in nodes.json
|
|
||||||
- `disk_pressure_*` — set `disk_pressure` field on the node record
|
|
||||||
|
|
||||||
## Incident Lifecycle
|
|
||||||
|
|
||||||
1. **Detection**: A `service_unhealthy` or `healthcheck_failed` event creates or increments an active incident.
|
|
||||||
2. **Correlation**: Multiple failure events for the same `node/service` are collapsed into one incident, incrementing `occurrence_count`.
|
|
||||||
3. **Resolution**: A `service_healthy` or `service_recovered` event resolves any active incident for that service, setting `status: resolved` and `resolved_at`.
|
|
||||||
4. **Expiry**: Resolved incidents older than 7 days are pruned from world state by `_prune_stale_world()`.
|
|
||||||
|
|
||||||
### Example Incident JSON
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"inc-1715518800-vps-observer": {
|
|
||||||
"id": "inc-1715518800-vps-observer",
|
|
||||||
"node": "vps",
|
|
||||||
"service": "observer",
|
|
||||||
"status": "resolved",
|
|
||||||
"severity": "error",
|
|
||||||
"started_at": 1715518800.0,
|
|
||||||
"last_occurrence": 1715518860.0,
|
|
||||||
"occurrence_count": 2,
|
|
||||||
"trigger_type": "containers_not_running",
|
|
||||||
"resolved_at": 1715519100.0
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
## World State Pruning
|
|
||||||
|
|
||||||
`_prune_stale_world()` runs every reconcile cycle and removes:
|
|
||||||
|
|
||||||
1. **Stale nodes** — nodes not present in `inventory/topology.yaml` (e.g. ghost nodes created when `NODE_NAME` was unset and fell back to the container's 12-char hex ID).
|
|
||||||
2. **Services of stale nodes** — all `node/service` keys whose node was pruned.
|
|
||||||
3. **Ghost service keys** — service keys whose service-name portion matches the pattern `<12hexchars>_<name>` (Docker internal stale-state artifacts, created when node-agent used `c.name` instead of the compose label).
|
|
||||||
4. **Expired incidents** — resolved incidents older than 7 days.
|
|
||||||
|
|
||||||
## Runtime Behavior
|
|
||||||
|
|
||||||
### Idempotency
|
|
||||||
The observer processes events in order. Deleting the checkpoint and restarting replays all events and produces the same world state.
|
|
||||||
|
|
||||||
### Deployment Tracking
|
|
||||||
Deployments are tracked via `correlation_id`. The observer synthesizes the start, end, and status of each deployment run from events.
|
|
||||||
|
|
||||||
### Topology Filtering
|
|
||||||
Events from nodes not listed in `inventory/topology.yaml` are discarded during pruning. This prevents transient bootstrap noise from polluting world state.
|
|
||||||
|
|
@ -1,234 +0,0 @@
|
||||||
# SESSION: Budowa planner-agent — LLM-based diagnostics
|
|
||||||
|
|
||||||
**DATA:** 2026-05-27
|
|
||||||
**REZULTAT:** planner-agent działa na SOLARIA (`healthy`), Ollama primary, cloud fallback gotowy do włączenia
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Co zostało zbudowane
|
|
||||||
|
|
||||||
### `services/planner-agent/src/llm_router.py`
|
|
||||||
|
|
||||||
Moduł LLM routing z local-first fallback chain:
|
|
||||||
|
|
||||||
- **`LLMRouter`** — główna klasa routingu przez litellm
|
|
||||||
- **`ModelConfig`** — konfiguracja jednego modelu (name, timeout, api_base, extra_kwargs)
|
|
||||||
- **`ModelMetrics`** — liczniki per model × outcome (`success`/`fallback`/`error`); success_rate
|
|
||||||
- **`RouteResult`** — wynik routingu z `content`, `model_used`, `attempts`, `latency_ms`
|
|
||||||
- **`AttemptRecord`** — zapis jednej próby (model, outcome, reason, latency_ms)
|
|
||||||
- **`_extract_json_from_fence()`** — wydobywa JSON z bloków ` ```json ``` ` jeśli model nie odpowie czystym JSON
|
|
||||||
|
|
||||||
Domyślny chain: `ollama/qwen2.5:7b` (8s) → `claude-haiku-4-5-20251001` (30s) → `claude-sonnet-4-6` (30s)
|
|
||||||
|
|
||||||
Metryki każdego wywołania publikowane na Redis kanał `llm_router_metrics`.
|
|
||||||
|
|
||||||
### `services/planner-agent/src/planner.py`
|
|
||||||
|
|
||||||
Główna pętla agenta:
|
|
||||||
|
|
||||||
- **`PlannerAgent`** — async agent: Redis sub → diagnoza LLM → pending action file → event
|
|
||||||
- **`HealthEvent`** — znormalizowane zdarzenie zdrowotne z Redis (node, service, event_type, severity, payload)
|
|
||||||
- **`ActionProposal`** — propozycja akcji z pełnymi metadanymi; `.to_action_file()` → format executora
|
|
||||||
- **`CooldownTracker`** — gate 5-minutowy per `svc_key` (node/service); NIE rejestruje jeśli LLM się wysypał
|
|
||||||
- **`parse_event()`** — normalizuje dwa formaty wejściowe (node-agent / control-plane)
|
|
||||||
- **`write_pending_action()`** — atomiczny zapis: `.tmp` → rename
|
|
||||||
- **`emit_event()`** — zapis zdarzenia `remediation_started` do systemu plików (bez importów z control-plane)
|
|
||||||
|
|
||||||
Pipeline:
|
|
||||||
```
|
|
||||||
Redis msg → parse_event() → benign skip → cooldown gate → _propose_action() (LLM)
|
|
||||||
→ write_pending_action() → emit_event("remediation_started")
|
|
||||||
```
|
|
||||||
|
|
||||||
### Pliki towarzyszące
|
|
||||||
|
|
||||||
| Plik | Opis |
|
|
||||||
|------|------|
|
|
||||||
| `service.yaml` | Kontrakt operacyjny: owner_node=solaria, deps=redis+ollama, healthcheck=file |
|
|
||||||
| `docker-compose.yml` | env_file + extra_hosts:host-gateway + ANTHROPIC_API_KEY w environment |
|
|
||||||
| `Dockerfile` | python:3.11-slim, litellm, redis, jsonschema, structlog |
|
|
||||||
| `healthcheck.sh` | Sprawdza wiek pliku heartbeat (max 300s) |
|
|
||||||
| `requirements.txt` | litellm, redis, jsonschema, structlog |
|
|
||||||
| `tests/test_planner.py` | 49 testów jednostkowych |
|
|
||||||
| `tests/test_llm_router.py` | 34 testy jednostkowe |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Kluczowe decyzje architektoniczne
|
|
||||||
|
|
||||||
### 1. HITL invariant (Human-in-the-loop)
|
|
||||||
|
|
||||||
Planner **wyłącznie** zapisuje do `actions/pending/`. Executor wymaga pliku w `actions/approved/`.
|
|
||||||
Planner nigdy nie wykona akcji samodzielnie — to fundamentalna zasada systemu.
|
|
||||||
|
|
||||||
Implementacja: `write_pending_action()` pisze do `pending/`, żadna ścieżka w kodzie nie dotyka `approved/`.
|
|
||||||
|
|
||||||
### 2. Cooldown gate
|
|
||||||
|
|
||||||
Per `svc_key` (= `node/service`), domyślnie 5 minut. Cel: nie zalewać operatora powtórzonymi
|
|
||||||
propozycjami dla tego samego serwisu.
|
|
||||||
|
|
||||||
**Kluczowa decyzja:** cooldown NIE jest rejestrowany jeśli cały chain LLM się wysypał.
|
|
||||||
Dzięki temu kolejne zdarzenie może spróbować ponownie, zamiast być cicho zablokowanym
|
|
||||||
przez 5 minut mimo że nie powstała żadna propozycja.
|
|
||||||
|
|
||||||
### 3. Fallback chain — local-first
|
|
||||||
|
|
||||||
Kolejność: Ollama (lokalny GPU) → Haiku → Sonnet.
|
|
||||||
|
|
||||||
Uzasadnienie:
|
|
||||||
- Ollama nie wysyła danych do zewnętrznych serwisów; niskie opóźnienie dla prostych przypadków
|
|
||||||
- Haiku = szybki i tani cloud fallback
|
|
||||||
- Sonnet = ostatnia deska ratunku dla trudnych przypadków
|
|
||||||
|
|
||||||
Odrzucenie modelu na podstawie: timeout, błąd sieci, wzorzec odmowy, invalid JSON, schema error.
|
|
||||||
|
|
||||||
### 4. Brak importów z control-plane
|
|
||||||
|
|
||||||
`services/planner-agent/` jest w pełni samodzielny. Nie importuje nic z
|
|
||||||
`services/control-plane/`. Emisja eventów jest implementowana lokalnie (kopia logiki
|
|
||||||
`scripts/lib/events.py`).
|
|
||||||
|
|
||||||
Uzasadnienie: planner musi działać nawet jeśli control-plane jest offline; oddzielne
|
|
||||||
cykl deploymentu.
|
|
||||||
|
|
||||||
### 5. structlog z PrintLoggerFactory
|
|
||||||
|
|
||||||
Nie używamy `structlog.stdlib.add_logger_name` — `PrintLogger` nie ma atrybutu `.name`.
|
|
||||||
Zamiast tego łańcuch procesorów: `add_log_level` → `TimeStamper` → `StackInfoRenderer`
|
|
||||||
→ `format_exc_info` → `JSONRenderer`.
|
|
||||||
|
|
||||||
### 6. NODE_NAME czytany w czasie wywołania, nie importu
|
|
||||||
|
|
||||||
`_emit_event_sync` czyta `NODE_NAME` z modułowego `NODE_NAME` przy każdym wywołaniu
|
|
||||||
(nie jako default parameter). Umożliwia patchowanie w testach.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Problemy napotkane i rozwiązania
|
|
||||||
|
|
||||||
### Problem: `localhost` w kontenerze nie sięga do hosta
|
|
||||||
|
|
||||||
**Kontekst:** Ollama działa na SOLARIA pod `localhost:11434`. Kontener Docker
|
|
||||||
z domyślną siecią bridge nie może sięgnąć do hosta przez `localhost`.
|
|
||||||
|
|
||||||
**Rozwiązanie:**
|
|
||||||
1. Dodano `extra_hosts: - "host-gateway:host-gateway"` do docker-compose.yml
|
|
||||||
2. `.env` używa `OLLAMA_HOST=http://host-gateway:11434`
|
|
||||||
|
|
||||||
### Problem: `environment` vs `env_file` — podwójne zmienne
|
|
||||||
|
|
||||||
**Kontekst:** Pierwsza wersja docker-compose.yml miała wszystkie zmienne hardkodowane
|
|
||||||
w sekcji `environment` z fallback wartościami (`${VAR:-default}`). Powodowało to
|
|
||||||
że `.env` był opcjonalny a nie wymagany.
|
|
||||||
|
|
||||||
**Rozwiązanie:** Usunięto wszystkie zmienne runtime z `environment`, przeniesiono do `env_file`.
|
|
||||||
Pozostał tylko `ANTHROPIC_API_KEY` w `environment` (opcjonalny sekret, nie powinien być w pliku na dysku).
|
|
||||||
|
|
||||||
### Problem: `structlog.stdlib.add_logger_name` crashuje z PrintLogger
|
|
||||||
|
|
||||||
**Symptom:** `AttributeError: 'PrintLogger' object has no attribute 'name'`
|
|
||||||
|
|
||||||
**Rozwiązanie:** Usunięto `add_logger_name` z łańcucha procesorów. Nie jest
|
|
||||||
kompatybilny z `PrintLoggerFactory`.
|
|
||||||
|
|
||||||
### Problem: verify stage failuje zaraz po starcie
|
|
||||||
|
|
||||||
**Symptom:** `deploy.sh` raportuje FAILED przy verify bo heartbeat nie istnieje.
|
|
||||||
|
|
||||||
**Przyczyna:** Race condition — agent potrzebuje kilku sekund na uruchomienie
|
|
||||||
pętli i pierwsze `touch()` heartbeatu.
|
|
||||||
|
|
||||||
**Rozwiązanie:** Nie jest to prawdziwy błąd. Docker healthcheck ma `start_period: 30s`.
|
|
||||||
Kontener pokazuje `(healthy)` po 30s od startu.
|
|
||||||
|
|
||||||
### Problem: git pull z divergent branches na solaria
|
|
||||||
|
|
||||||
**Symptom:** Solaria miała 2 lokalne commity nie będące na Forgejo + ręczne zmiany w working tree.
|
|
||||||
`git pull` failował z "Need to specify how to reconcile divergent branches."
|
|
||||||
|
|
||||||
**Rozwiązanie:**
|
|
||||||
```bash
|
|
||||||
git checkout -- services/planner-agent/docker-compose.yml # porzuć ręczne zmiany
|
|
||||||
git fetch origin
|
|
||||||
git rebase origin/master # rebase local commits on top of master
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Status deploymentu na SOLARIA
|
|
||||||
|
|
||||||
```
|
|
||||||
Container: planner-agent Up ~30m (healthy)
|
|
||||||
Image: planner-agent-planner-agent
|
|
||||||
Node: solaria (100.100.231.104)
|
|
||||||
Heartbeat: /opt/homelab/state/planner-agent.heartbeat (age 0s)
|
|
||||||
|
|
||||||
Channels subscribed:
|
|
||||||
- health_events
|
|
||||||
- world_updates
|
|
||||||
|
|
||||||
LLM chain:
|
|
||||||
PRIMARY: ollama/qwen2.5-coder:14b @ http://host-gateway:11434
|
|
||||||
FALLBACK: claude-haiku-4-5-20251001 (disabled — brak ANTHROPIC_API_KEY)
|
|
||||||
FALLBACK: claude-sonnet-4-6 (disabled — brak ANTHROPIC_API_KEY)
|
|
||||||
|
|
||||||
Redis: redis://100.108.208.3:6379 ✓ connected
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Co zostało na później
|
|
||||||
|
|
||||||
### 1. ANTHROPIC_API_KEY — cloud fallback wyłączony
|
|
||||||
|
|
||||||
Haiku i Sonnet są skonfigurowane w chain ale nie mają klucza API.
|
|
||||||
Gdy Ollama nie da rady (złożony przypadek / timeout), chain się wysypie bez fallbacku.
|
|
||||||
|
|
||||||
Aby włączyć:
|
|
||||||
```bash
|
|
||||||
ssh oskar@100.100.231.104
|
|
||||||
echo "ANTHROPIC_API_KEY=sk-ant-..." >> /opt/homelab/config/planner-agent/.env
|
|
||||||
docker compose -f ~/homelab-codex-ws/services/planner-agent/docker-compose.yml up -d
|
|
||||||
```
|
|
||||||
|
|
||||||
### 2. End-to-end test z prawdziwym eventem
|
|
||||||
|
|
||||||
Planner jest podłączony do Redis i nasłuchuje, ale żadne zdarzenie jeszcze nie
|
|
||||||
przeszło przez pełną ścieżkę (LLM call → pending action → operator UI).
|
|
||||||
|
|
||||||
Test:
|
|
||||||
```bash
|
|
||||||
redis-cli -h 100.108.208.3 PUBLISH health_events '{
|
|
||||||
"type": "service_unhealthy",
|
|
||||||
"node": "piha",
|
|
||||||
"service": "mosquitto",
|
|
||||||
"severity": "error",
|
|
||||||
"payload": {"reason": "container exited"},
|
|
||||||
"timestamp": "2026-05-27T20:00:00Z"
|
|
||||||
}'
|
|
||||||
# Obserwuj: docker logs planner-agent -f
|
|
||||||
# Sprawdź: ls /opt/homelab/actions/pending/
|
|
||||||
```
|
|
||||||
|
|
||||||
### 3. Solaria local commits
|
|
||||||
|
|
||||||
Solaria ma 2 lokalne commity (`feat: add ECC skills`, `fix: remove duplicate CLAUDE.md sections`)
|
|
||||||
które nie są na Forgejo. Zostały zrebase'owane na top of master ale nie wypchnięte.
|
|
||||||
Należy je wypchnąć lub zreviewować i ewentualnie squashować.
|
|
||||||
|
|
||||||
### 4. Integracja z operator UI / Telegram
|
|
||||||
|
|
||||||
Propozycje w `actions/pending/` nie mają jeszcze kanału notyfikacji do operatora.
|
|
||||||
Telegram bot powinien wysyłać powiadomienie gdy pojawi się nowy plik w `pending/`.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Commity tej sesji
|
|
||||||
|
|
||||||
```
|
|
||||||
ff6fda1 planner-agent: use env_file, keep only ANTHROPIC_API_KEY in environment
|
|
||||||
ca37fca Add planner-agent: LLM-powered remediation planner
|
|
||||||
(llm_router.py, planner.py, tests, service.yaml, docker-compose.yml,
|
|
||||||
healthcheck.sh, Dockerfile)
|
|
||||||
```
|
|
||||||
|
|
@ -1,103 +0,0 @@
|
||||||
# SESSION: Stabilizacja systemu wieloagentowego homelabu
|
|
||||||
|
|
||||||
**DATE:** 2026-05-27
|
|
||||||
**RESULT:** System NOMINAL (97/97 services, 0 errors)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## PROBLEMS FOUND
|
|
||||||
|
|
||||||
- stability-agent nie generował akcji naprawczych — tylko redeploy, brak container_restart
|
|
||||||
- mosquitto na chelsty-infra padł i nikt go nie restartował (restart policy był `no`)
|
|
||||||
- zigbee2mqtt nigdy nie był wdrożony na chelsty-infra
|
|
||||||
- node-agent był pustym szkieletem — nie emitował `service_healthy`, więc `services.json` zawsze był pusty
|
|
||||||
- ghost services: node-agent używał `c.name` (może zwrócić `<12hex>_real-name`) zamiast etykiety `com.docker.compose.service`
|
|
||||||
- materializer na piha czytał ze swojego lokalnego Redis zamiast z control-plane API — Redis zawierał 80 przestarzałych wpisów z ghost kluczami; "Copy for AI" zwracał stare dane
|
|
||||||
- observer używał jednego globalnego checkpointu zamiast per-node — cicho pomijał katalogi z eventami sortujące się przed aktualnym checkpointem
|
|
||||||
- supervisor nie cancelował resolved actions — pending queue rósł bez końca
|
|
||||||
- `service_healthy` event nie zamykał aktywnych incydentów
|
|
||||||
- NODE_ALIAS_MAP nie był skonfigurowany — mismatch nazw nodów między eventem a topology
|
|
||||||
- chelsty-ha błędnie w scope monitoringu — nie ma na nim node-agenta
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## FIXES SHIPPED (commits in master)
|
|
||||||
|
|
||||||
```
|
|
||||||
7277bdc Fix Copy for AI: materializer fetches from control-plane API instead of Redis
|
|
||||||
b40b832 Fix ghost service keys from hash-prefixed Docker container names
|
|
||||||
28e9534 observer: service_healthy resolves active incidents
|
|
||||||
46ae92b supervisor: also cancel pending actions for services removed from desired state
|
|
||||||
410bfe7 zigbee2mqtt: config goes in data dir (writable), not separate ro mount
|
|
||||||
b3912fe zigbee2mqtt: use extra_hosts host-gateway instead of network_mode: host
|
|
||||||
61e07f4 zigbee2mqtt override: clear ports list for docker-compose v1 host network compat
|
|
||||||
51002d4 Fix pending actions: node_exporter, zigbee2mqtt, chelsty-ha monitoring
|
|
||||||
fb7828b supervisor: auto-cancel pending actions when drift is resolved
|
|
||||||
2f19657 fix(node-agent): unique event IDs per service to prevent same-second overwrites
|
|
||||||
267742c vps/node-agent: add network_mode: host for control-plane health probe
|
|
||||||
4e8968f Fix service health tracking: emit service_healthy, control-plane endpoint, checkpoint migration
|
|
||||||
f4a8db9 fix(observer): per-node-directory checkpoints replace single global checkpoint
|
|
||||||
a5a3e22 fix(node-agent): skip SSH config file in rsync to avoid UID ownership errors
|
|
||||||
2349de5 fix(node-agent): correct VPS_EVENTS_HOST to actual VPS Tailscale IP
|
|
||||||
65bac4e fix(node-agent): mount host SSH key into container for event shipping
|
|
||||||
96bf326 fix(observer+operator-ui): fix stale world state, dict→list API, event time filter
|
|
||||||
ae33cce feat(node-agent): add runtime overrides for piha, solaria, chelsty-infra
|
|
||||||
c5c080b feat(vps): add node-agent runtime override with NODE_NAME=vps
|
|
||||||
01b7758 feat(node-agent): implement health monitor and safe cleanup policy
|
|
||||||
```
|
|
||||||
|
|
||||||
### Szczegóły kluczowych napraw
|
|
||||||
|
|
||||||
**fix(observer): per-node checkpoints**
|
|
||||||
Jeden globalny checkpoint `last_processed_file` cicho pomijał katalogi eventów sortujące się alfabetycznie przed ostatnim przetworzonym węzłem (np. piha/ < vps/). Zastąpiony słownikiem `{"node_checkpoints": {"piha": "...", "vps": "..."}}` per-node.
|
|
||||||
|
|
||||||
**fix(observer): ghost key pruning**
|
|
||||||
`_prune_stale_world()` teraz usuwa wpisy z services.json których klucz serwisu pasuje do wzorca `<12hexchars>_<name>` — artefakty z Docker internal state tracking.
|
|
||||||
|
|
||||||
**fix(node-agent): canonical container name**
|
|
||||||
`check_containers()` teraz używa `com.docker.compose.service` label jako nazwy kanonicznej. Fallback: strip hash prefix z `c.name`. Kontenery w stanie `created` są pomijane (Docker stale-state artifacts).
|
|
||||||
|
|
||||||
**fix(node-agent): service_healthy emission**
|
|
||||||
Node-agent teraz emituje `service_healthy` dla każdego uruchomionego zarządzanego kontenera co cykl. Bez tego `services.json` był zawsze pusty — supervisor generował flood "missing service" redeployów.
|
|
||||||
|
|
||||||
**fix(supervisor): auto-cancel resolved actions**
|
|
||||||
`_cancel_resolved_pending_actions()` przenosi pending akcje do `cancelled/` gdy:
|
|
||||||
- serwis stał się healthy (`drift_resolved_auto`)
|
|
||||||
- serwis został usunięty z desired state (`service_removed_from_desired_state`)
|
|
||||||
|
|
||||||
**fix(supervisor): monitor:false**
|
|
||||||
Pole `monitor: false` w `services.yaml` wyklucza serwis z generowania akcji supervisora. Używane dla `homeassistant` na chelsty-ha (brak node-agenta).
|
|
||||||
|
|
||||||
**fix(agent-system/materializer): control-plane API as source**
|
|
||||||
Materializer na piha teraz fetchuje dane z VPS control-plane API (`CONTROL_PLANE_URL=http://100.95.58.48:18180`) zamiast z lokalnego Redis. Redis zawierał 80 przestarzałych wpisów. Redis path zachowany jako fallback.
|
|
||||||
|
|
||||||
**fix(chelsty-infra/zigbee2mqtt): mosquitto networking**
|
|
||||||
Mosquitto działa z `network_mode: host` — kontenery bridge nie mogą go dosięgnąć przez localhost. Rozwiązanie: `extra_hosts: - "mosquitto:host-gateway"` w override z2m. Nie używamy `network_mode: host` dla z2m bo koliduje z `ports:` w docker-compose v1 (1.29.2 na chelsty-infra).
|
|
||||||
|
|
||||||
**fix(chelsty-infra/zigbee2mqtt): writable config**
|
|
||||||
z2m migruje i nadpisuje `configuration.yaml` przy starcie. Config musi być w katalogu z danymi: `/opt/homelab/data/zigbee2mqtt/data/configuration.yaml` (read-write mount), nie w osobnym `:ro` wolumenie.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## STAN KOŃCOWY
|
|
||||||
|
|
||||||
| Node | Status | Serwisy |
|
|
||||||
|------|--------|---------|
|
|
||||||
| vps | online | control-plane (4), node-agent, node_exporter, stability-agent |
|
|
||||||
| piha | online | agent-system (4), node-agent, stability-agent, monitoring stack |
|
|
||||||
| solaria | online | node-agent, stability-agent, AI workloads |
|
|
||||||
| chelsty-infra | online | mosquitto, zigbee2mqtt (z2m łączy się gdy SLZB-06U wróci online), node-agent, stability-agent |
|
|
||||||
| chelsty-ha | — | homeassistant (monitor:false — brak node-agenta, HA monitorowane pośrednio przez MQTT) |
|
|
||||||
|
|
||||||
**Action queue:** 0 pending, 0 approved, 0 running
|
|
||||||
**Incidents:** 0 active
|
|
||||||
**Ghost service keys:** 0
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## ZNANE OGRANICZENIA / TODO
|
|
||||||
|
|
||||||
- SLZB-06U (Zigbee coordinator) offline — `192.168.1.105:6638` EHOSTUNREACH z chelsty-infra. Prawdopodobnie problem sprzętowy/sieciowy po stronie 192.168.1.0/24. z2m startuje i serwuje stronę błędu na :8080 — połączy się automatycznie gdy coordinator wróci.
|
|
||||||
- `ezsp` adapter w konfiguracji z2m jest deprecated — zalecana migracja do `ember`. Nie wymaga nowej konfiguracji, tylko zmiana pola `adapter: ember` w `configuration.yaml`.
|
|
||||||
- chelsty-ha nie ma node-agenta. Dodać gdy będzie dostępna maszyna lub manual bootstrap.
|
|
||||||
- Redis na piha nadal zawiera stare klucze `homelab:nodes:*`, `homelab:incidents:*` etc. — nie są już używane przez materializer w trybie API, można wyczyścić.
|
|
||||||
|
|
@ -1,62 +0,0 @@
|
||||||
# Stability Agent Multi-Node Rollout
|
|
||||||
|
|
||||||
## Architecture Summary
|
|
||||||
The `stability-agent` is a lightweight Python service that monitors node health (disk, Docker containers, Tailscale, MQTT) and publishes state to a central Redis instance running on **PIHA**.
|
|
||||||
|
|
||||||
- **Source**: `services/stability-agent`
|
|
||||||
- **State Path**: `/opt/homelab/state`
|
|
||||||
- **Events Path**: `/opt/homelab/events`
|
|
||||||
- **Redis Target**: `100.108.208.3:6379` (PIHA)
|
|
||||||
|
|
||||||
## Why UI only showed CHELSTY
|
|
||||||
Previously, the `stability-agent` had `NODE_NAME` defaulted to `chelsty` and was only deployed there. The Agent System UI materializer on PIHA filters nodes based on the Redis keys `homelab:nodes:<NODE_NAME>`. Without other agents publishing their specific `NODE_NAME`, the UI remained limited to the single active node.
|
|
||||||
|
|
||||||
## Deployment
|
|
||||||
|
|
||||||
Use the helper script to deploy or generate commands. The script uses explicit Tailscale IPs for remote targets (piha, chelsty, vps) and runs locally for solaria.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Print commands
|
|
||||||
./scripts/deploy/deploy-stability-agent.sh <node-name>
|
|
||||||
|
|
||||||
# Deploy via SSH (executes ssh oskar@<ip>)
|
|
||||||
./scripts/deploy/deploy-stability-agent.sh <node-name> --ssh
|
|
||||||
```
|
|
||||||
|
|
||||||
### Manual Steps per Node
|
|
||||||
The manual steps are encapsulated in `services/stability-agent/deploy-local.sh`. On the target node:
|
|
||||||
```bash
|
|
||||||
cd /home/oskar/homelab-codex-ws
|
|
||||||
git fetch origin
|
|
||||||
git checkout master
|
|
||||||
git pull origin master
|
|
||||||
cd services/stability-agent
|
|
||||||
./deploy-local.sh <node-name>
|
|
||||||
```
|
|
||||||
|
|
||||||
## Verification
|
|
||||||
|
|
||||||
### Fleet Overview
|
|
||||||
Run the verification script from any node with `redis-cli` access:
|
|
||||||
```bash
|
|
||||||
./scripts/deploy/verify-agent-fleet.sh
|
|
||||||
```
|
|
||||||
|
|
||||||
### Redis Inspection (on PIHA)
|
|
||||||
```bash
|
|
||||||
docker exec agent-system-redis redis-cli KEYS 'homelab:nodes:*'
|
|
||||||
docker exec agent-system-redis redis-cli HGETALL homelab:nodes:<node-name>
|
|
||||||
```
|
|
||||||
|
|
||||||
Verify Web UI backend:
|
|
||||||
```bash
|
|
||||||
curl -s http://127.0.0.1:18180/nodes
|
|
||||||
curl -k https://agents.okit.pl/nodes
|
|
||||||
```
|
|
||||||
|
|
||||||
## Troubleshooting
|
|
||||||
|
|
||||||
- **Redis empty after compose down**: The `agent-system-redis` on PIHA uses transient storage if not configured with a volume. If it restarts, agents must republish their state (they do this automatically every `CHECK_INTERVAL`).
|
|
||||||
- **Secrets**: `.env` files and local secrets are not committed to the repo. Ensure `MQTT_HOST` and other specific secrets are set via overrides if needed.
|
|
||||||
- **Telegram**: Telegram bot notifications can remain disabled if `TELEGRAM_BOT_TOKEN` is absent.
|
|
||||||
- **Docker Socket**: If the agent reports `unavailable` for Docker, ensure `/var/run/docker.sock` is mounted and the user has permissions.
|
|
||||||
|
|
@ -49,10 +49,9 @@ Runtime state must live outside the repository to keep it immutable and clean.
|
||||||
## Service Standards
|
## Service Standards
|
||||||
|
|
||||||
1. **Normalization**: Every service MUST follow the `services/<service>/` layout.
|
1. **Normalization**: Every service MUST follow the `services/<service>/` layout.
|
||||||
2. **Metadata**: Every service MUST have a `service.yaml` defining its operational contract. This is the primary source of truth for AI agents.
|
2. **Metadata**: Every service MUST have a `service.yaml` defining its operational contract.
|
||||||
3. **Healthchecks**: Every service MUST have a `healthcheck.sh` for verification. Agents use this to emit stability events.
|
3. **Healthchecks**: Every service MUST have a `healthcheck.sh` for verification.
|
||||||
4. **Actionability**: Any automated recovery action proposed by an agent must be backed by a `service.yaml` definition.
|
4. **Secrets**: NEVER commit secrets to Git. Use `env.example` as a template and populate `/opt/homelab/config/<service>/.env` on the host.
|
||||||
5. **Secrets**: NEVER commit secrets to Git. Use `env.example` as a template and populate `/opt/homelab/config/<service>/.env` on the host. Agents must treat these as "black box" configurations.
|
|
||||||
|
|
||||||
## Docker Compose Standards
|
## Docker Compose Standards
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -1,126 +0,0 @@
|
||||||
# VPS Control Plane
|
|
||||||
|
|
||||||
The VPS Control Plane is the orchestration brain of the homelab platform. It runs on the Hetzner VPS (Tailscale IP: `100.95.58.48`) and provides observability, automated reconciliation, and a web-based operator interface.
|
|
||||||
|
|
||||||
## Architecture
|
|
||||||
|
|
||||||
The control plane consists of four core services running as a Docker Compose stack under `services/control-plane/`:
|
|
||||||
|
|
||||||
| Container | Role |
|
|
||||||
|-----------|------|
|
|
||||||
| `control-plane-observer` | Synthesizes world state from events in `/opt/homelab/events/` |
|
|
||||||
| `control-plane-supervisor` | Detects drift between desired state (`hosts/*/services.yaml`) and actual state (`world/services.json`); writes pending actions |
|
|
||||||
| `control-plane-executor` | Executes approved actions from `/opt/homelab/actions/approved/` |
|
|
||||||
| `control-plane-ui` | Web interface for system monitoring and action approval; serves port 18180 |
|
|
||||||
|
|
||||||
All services use **filesystem-first** semantics with `/opt/homelab/` as the data exchange layer. All four run with `network_mode: host` and as UID 1000 (`homelab` user).
|
|
||||||
|
|
||||||
## Supervisor Behavior
|
|
||||||
|
|
||||||
### Desired State
|
|
||||||
Loaded from `hosts/*/services.yaml` each reconcile cycle. Services with `monitor: false` are silently skipped — use this for services without a node-agent (e.g. `homeassistant` on `chelsty-ha`).
|
|
||||||
|
|
||||||
### Drift Types
|
|
||||||
- `missing_service` — service is in desired state but absent from `services.json`
|
|
||||||
- `unhealthy_service` — service exists in `services.json` but `status != healthy`
|
|
||||||
|
|
||||||
### Action Types
|
|
||||||
| Trigger | Action type | Risk |
|
|
||||||
|---------|-------------|------|
|
|
||||||
| `containers_not_running`, `mqtt_unreachable` | `container_restart` | low |
|
|
||||||
| Any other / unknown | `redeploy` | guarded |
|
|
||||||
| Node `disk_pressure: high` | `disk_cleanup` | guarded |
|
|
||||||
|
|
||||||
### Action ID Stability
|
|
||||||
Action IDs are deterministic: `redeploy-{node}-{service}` or `container-restart-{node}-{service}`. The same drift always produces the same filename, making reconcile truly idempotent across supervisor restarts.
|
|
||||||
|
|
||||||
### Auto-Cancel
|
|
||||||
Pending `redeploy` and `container_restart` actions are automatically moved to `cancelled/` when:
|
|
||||||
- **`drift_resolved_auto`** — the service becomes `healthy` in actual state
|
|
||||||
- **`service_removed_from_desired_state`** — the service was removed from `services.yaml` or marked `monitor: false`
|
|
||||||
|
|
||||||
Only `pending` actions are auto-cancelled. Approved/running actions have been committed to by the operator and are never cancelled automatically.
|
|
||||||
|
|
||||||
### Node Name Resolution
|
|
||||||
The supervisor supports a `NODE_ALIAS_MAP` environment variable (JSON string) to map event/world-state node names to canonical topology names:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
NODE_ALIAS_MAP='{"node-2": "chelsty-infra", "node-1": "piha"}'
|
|
||||||
```
|
|
||||||
|
|
||||||
## Deployment
|
|
||||||
|
|
||||||
### From SATURN (primary control node)
|
|
||||||
```bash
|
|
||||||
# Full deploy via SSH
|
|
||||||
./scripts/deploy/deploy-control-plane.sh --ssh
|
|
||||||
|
|
||||||
# Or manually:
|
|
||||||
ssh oskar@100.95.58.48 "cd ~/homelab-codex-ws && git pull origin master && cd services/control-plane && docker compose up -d --build --force-recreate"
|
|
||||||
```
|
|
||||||
|
|
||||||
### Direct on VPS
|
|
||||||
```bash
|
|
||||||
cd ~/homelab-codex-ws/services/control-plane
|
|
||||||
docker compose up -d --build --force-recreate
|
|
||||||
```
|
|
||||||
|
|
||||||
`deploy-local.sh` also creates the required `/opt/homelab/` directory structure and sets ownership to UID 1000 (requires `sudo`). If directories already exist, skip to the `docker compose` step directly.
|
|
||||||
|
|
||||||
### Verification
|
|
||||||
```bash
|
|
||||||
# On VPS
|
|
||||||
docker ps --filter "name=control-plane"
|
|
||||||
curl -s http://localhost:18180/summary | python3 -m json.tool
|
|
||||||
```
|
|
||||||
|
|
||||||
## Action Approval Workflow
|
|
||||||
|
|
||||||
```
|
|
||||||
Supervisor writes → /opt/homelab/actions/pending/<id>.json
|
|
||||||
→ Operator UI (port 18180) or Telegram Bot notifies
|
|
||||||
→ Operator clicks Approve
|
|
||||||
→ /opt/homelab/actions/approved/<id>.json
|
|
||||||
→ Executor executes → completed / failed
|
|
||||||
```
|
|
||||||
|
|
||||||
Possible action states: `pending → approved → running → completed / failed / rejected`
|
|
||||||
Auto-cancel path: `pending → cancelled/`
|
|
||||||
|
|
||||||
## Recovery
|
|
||||||
|
|
||||||
### World state is stale or corrupt
|
|
||||||
```bash
|
|
||||||
# On VPS — delete checkpoint to force full replay
|
|
||||||
rm /opt/homelab/state/observer_checkpoint.json
|
|
||||||
docker restart control-plane-observer
|
|
||||||
```
|
|
||||||
|
|
||||||
### Flood of pending actions after bootstrap
|
|
||||||
Check if node-agent is running and emitting `service_healthy` events on each node. Without `service_healthy`, the supervisor sees all services as missing and queues redeployments every cycle.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Check node-agent on each node
|
|
||||||
ssh oskar@<node> "docker ps --filter name=node-agent && docker logs node-agent --tail 20"
|
|
||||||
```
|
|
||||||
|
|
||||||
### Rebuild from scratch
|
|
||||||
```bash
|
|
||||||
ssh oskar@100.95.58.48 "cd ~/homelab-codex-ws/services/control-plane && docker compose up -d --build --force-recreate"
|
|
||||||
```
|
|
||||||
|
|
||||||
## Integration
|
|
||||||
|
|
||||||
### piha agent-system webui (port 18180 on piha)
|
|
||||||
The `agent-system-runtime-materializer` on piha polls the VPS control-plane API every 10 seconds and mirrors world state to piha's local `/opt/homelab/world/`. This ensures the **"Copy for AI"** button in the piha webui (`agent-system-webui`) reflects the same clean state as the VPS API.
|
|
||||||
|
|
||||||
Override: `hosts/piha/runtime/agent-system/docker-compose.override.yml` — sets `CONTROL_PLANE_URL=http://100.95.58.48:18180`.
|
|
||||||
|
|
||||||
### Nginx Proxy Manager
|
|
||||||
The operator UI at port 18180 can be proxied via NPM for external access. No WebSocket support required.
|
|
||||||
|
|
||||||
### Log Locations
|
|
||||||
- Container logs: `docker compose logs -f` (from `services/control-plane/`)
|
|
||||||
- Runtime events: `/opt/homelab/events/YYYY-MM-DD/`
|
|
||||||
- World state: `/opt/homelab/world/`
|
|
||||||
- Action queue: `/opt/homelab/actions/{pending,approved,running,completed,failed,cancelled}/`
|
|
||||||
|
|
@ -1,24 +0,0 @@
|
||||||
host: chelsty-ha
|
|
||||||
site: chelsty
|
|
||||||
|
|
||||||
capabilities:
|
|
||||||
networking:
|
|
||||||
reachability: tailscale-only
|
|
||||||
tailscale_ip: 100.122.201.23
|
|
||||||
ingress_suitability: false
|
|
||||||
bandwidth: LTE
|
|
||||||
|
|
||||||
runtime:
|
|
||||||
container_engine: docker
|
|
||||||
os: debian
|
|
||||||
|
|
||||||
operational:
|
|
||||||
connectivity: intermittent
|
|
||||||
availability_target: best-effort
|
|
||||||
offline_first: true
|
|
||||||
uplink: lte
|
|
||||||
|
|
||||||
deployment:
|
|
||||||
suitability:
|
|
||||||
- homeassistant
|
|
||||||
restricted: false
|
|
||||||
|
|
@ -1,20 +0,0 @@
|
||||||
hostname: chelsty-ha
|
|
||||||
site: chelsty
|
|
||||||
|
|
||||||
roles:
|
|
||||||
- homeassistant
|
|
||||||
|
|
||||||
network:
|
|
||||||
tailscale_ip: 100.122.201.23
|
|
||||||
|
|
||||||
runtime:
|
|
||||||
root: /opt/homelab
|
|
||||||
|
|
||||||
deployment:
|
|
||||||
mode: pull
|
|
||||||
managed_by: saturn
|
|
||||||
|
|
||||||
constraints:
|
|
||||||
connectivity:
|
|
||||||
intermittent: true
|
|
||||||
uplink: lte
|
|
||||||
|
|
@ -1,12 +0,0 @@
|
||||||
host: chelsty-ha
|
|
||||||
site: chelsty
|
|
||||||
|
|
||||||
services:
|
|
||||||
homeassistant:
|
|
||||||
role: home-automation-controller
|
|
||||||
offline_required: true
|
|
||||||
# monitor: false — chelsty-ha has no node-agent deployed, so there are no
|
|
||||||
# container-health events for the observer to track. HA is monitored
|
|
||||||
# indirectly via the chelsty-infra MQTT broker (if MQTT goes silent, HA
|
|
||||||
# is likely down). Re-enable once node-agent is bootstrapped on this VM.
|
|
||||||
monitor: false
|
|
||||||
|
|
@ -1,88 +0,0 @@
|
||||||
# Frigate NVR — chelsty-infra
|
|
||||||
# Hardware decode: Intel UHD 630 via VAAPI (/dev/dri/renderD128)
|
|
||||||
# Object detection: CPU (no Coral TPU)
|
|
||||||
# Cameras: 2x Reolink RLC-540 (5MP, WiFi)
|
|
||||||
#
|
|
||||||
# Required env vars in /opt/homelab/config/frigate/frigate.env:
|
|
||||||
# CAMERA1_IP, CAMERA1_USER, CAMERA1_PASS
|
|
||||||
# CAMERA2_IP, CAMERA2_USER, CAMERA2_PASS
|
|
||||||
# MQTT_USER, MQTT_PASS (if mosquitto auth is enabled)
|
|
||||||
|
|
||||||
mqtt:
|
|
||||||
enabled: true
|
|
||||||
host: 127.0.0.1
|
|
||||||
port: 1883
|
|
||||||
# user: "{MQTT_USER}"
|
|
||||||
# password: "{MQTT_PASS}"
|
|
||||||
|
|
||||||
detectors:
|
|
||||||
cpu1:
|
|
||||||
type: cpu
|
|
||||||
num_threads: 3
|
|
||||||
|
|
||||||
ffmpeg:
|
|
||||||
hwaccel_args: preset-vaapi
|
|
||||||
global_args:
|
|
||||||
- -hide_banner
|
|
||||||
- -loglevel
|
|
||||||
- warning
|
|
||||||
|
|
||||||
record:
|
|
||||||
enabled: true
|
|
||||||
retain:
|
|
||||||
days: 7
|
|
||||||
mode: all
|
|
||||||
events:
|
|
||||||
retain:
|
|
||||||
default: 14
|
|
||||||
mode: motion
|
|
||||||
|
|
||||||
snapshots:
|
|
||||||
enabled: true
|
|
||||||
retain:
|
|
||||||
default: 7
|
|
||||||
quality: 70
|
|
||||||
|
|
||||||
objects:
|
|
||||||
track:
|
|
||||||
- person
|
|
||||||
- car
|
|
||||||
- bicycle
|
|
||||||
filters:
|
|
||||||
person:
|
|
||||||
min_area: 5000
|
|
||||||
max_area: 100000
|
|
||||||
threshold: 0.7
|
|
||||||
|
|
||||||
cameras:
|
|
||||||
camera1:
|
|
||||||
ffmpeg:
|
|
||||||
inputs:
|
|
||||||
# Main stream — high-res recording
|
|
||||||
- path: rtsp://{CAMERA1_USER}:{CAMERA1_PASS}@{CAMERA1_IP}:554/h264Preview_01_main
|
|
||||||
roles:
|
|
||||||
- record
|
|
||||||
# Sub stream — low-res detection (lower CPU cost)
|
|
||||||
- path: rtsp://{CAMERA1_USER}:{CAMERA1_PASS}@{CAMERA1_IP}:554/h264Preview_01_sub
|
|
||||||
roles:
|
|
||||||
- detect
|
|
||||||
detect:
|
|
||||||
enabled: true
|
|
||||||
width: 640
|
|
||||||
height: 480
|
|
||||||
fps: 5
|
|
||||||
|
|
||||||
camera2:
|
|
||||||
ffmpeg:
|
|
||||||
inputs:
|
|
||||||
- path: rtsp://{CAMERA2_USER}:{CAMERA2_PASS}@{CAMERA2_IP}:554/h264Preview_01_main
|
|
||||||
roles:
|
|
||||||
- record
|
|
||||||
- path: rtsp://{CAMERA2_USER}:{CAMERA2_PASS}@{CAMERA2_IP}:554/h264Preview_01_sub
|
|
||||||
roles:
|
|
||||||
- detect
|
|
||||||
detect:
|
|
||||||
enabled: true
|
|
||||||
width: 640
|
|
||||||
height: 480
|
|
||||||
fps: 5
|
|
||||||
|
|
@ -1,25 +0,0 @@
|
||||||
services:
|
|
||||||
frigate:
|
|
||||||
container_name: frigate
|
|
||||||
image: ghcr.io/blakeblackshear/frigate:stable
|
|
||||||
restart: unless-stopped
|
|
||||||
privileged: true
|
|
||||||
shm_size: "256mb"
|
|
||||||
network_mode: host
|
|
||||||
devices:
|
|
||||||
- /dev/dri/renderD128:/dev/dri/renderD128
|
|
||||||
volumes:
|
|
||||||
- /etc/localtime:/etc/localtime:ro
|
|
||||||
- /opt/homelab/config/frigate/config.yml:/config/config.yml
|
|
||||||
- /opt/homelab/config/frigate:/config/credentials:ro
|
|
||||||
- /opt/homelab/data/frigate:/media/frigate
|
|
||||||
tmpfs:
|
|
||||||
- /tmp/cache
|
|
||||||
env_file:
|
|
||||||
- /opt/homelab/config/frigate/frigate.env
|
|
||||||
healthcheck:
|
|
||||||
test: ["CMD-SHELL", "wget -q --spider http://localhost:5000/api/version 2>&1 || exit 1"]
|
|
||||||
interval: 30s
|
|
||||||
timeout: 10s
|
|
||||||
retries: 3
|
|
||||||
start_period: 60s
|
|
||||||
|
|
@ -1,11 +0,0 @@
|
||||||
services:
|
|
||||||
node-agent:
|
|
||||||
environment:
|
|
||||||
- NODE_NAME=chelsty-infra
|
|
||||||
- NODE_TYPE=lte_node
|
|
||||||
- VPS_EVENTS_HOST=100.95.58.48
|
|
||||||
- VPS_EVENTS_USER=oskar
|
|
||||||
- VPS_EVENTS_PATH=/opt/homelab/events
|
|
||||||
- CHECK_INTERVAL=60
|
|
||||||
volumes:
|
|
||||||
- /home/oskar/.ssh:/root/.ssh:ro
|
|
||||||
|
|
@ -1,12 +0,0 @@
|
||||||
services:
|
|
||||||
stability-agent:
|
|
||||||
environment:
|
|
||||||
- NODE_NAME=chelsty-infra
|
|
||||||
- SITE_NAME=chelsty
|
|
||||||
- REDIS_HOST=100.108.208.3
|
|
||||||
- REDIS_PORT=6379
|
|
||||||
- REDIS_ENABLED=true
|
|
||||||
- STABILITY_CHECK_INTERVAL=60
|
|
||||||
- DISK_THRESHOLD_PCT=85
|
|
||||||
- MQTT_HOST=mosquitto
|
|
||||||
- MQTT_PORT=1883
|
|
||||||
|
|
@ -1,21 +0,0 @@
|
||||||
services:
|
|
||||||
zigbee2mqtt:
|
|
||||||
# mosquitto runs with network_mode: host on chelsty-infra.
|
|
||||||
# extra_hosts maps the 'mosquitto' hostname to the host gateway IP so that
|
|
||||||
# mqtt://mosquitto:1883 in configuration.yaml reaches the host-networked
|
|
||||||
# mosquitto process. Requires Docker 20.10+ (present on chelsty-infra).
|
|
||||||
extra_hosts:
|
|
||||||
- "mosquitto:host-gateway"
|
|
||||||
environment:
|
|
||||||
- TZ=Europe/Warsaw
|
|
||||||
healthcheck:
|
|
||||||
test: ["CMD-SHELL", "wget -qO- http://localhost:8080 > /dev/null 2>&1 || exit 1"]
|
|
||||||
interval: 30s
|
|
||||||
timeout: 10s
|
|
||||||
retries: 3
|
|
||||||
start_period: 90s
|
|
||||||
# Note: volumes NOT overridden here.
|
|
||||||
# The base docker-compose.yml mounts /opt/homelab/data/zigbee2mqtt/data:/app/data
|
|
||||||
# (read-write). configuration.yaml must be placed in that directory on the node:
|
|
||||||
# /opt/homelab/data/zigbee2mqtt/data/configuration.yaml
|
|
||||||
# z2m rewrites this file during migrations — read-only mount is not viable.
|
|
||||||
|
|
@ -1,37 +0,0 @@
|
||||||
host: chelsty-infra
|
|
||||||
site: chelsty
|
|
||||||
|
|
||||||
services:
|
|
||||||
ha-diag-agent:
|
|
||||||
role: ha-diagnostic-agent
|
|
||||||
deployment_model: docker-compose
|
|
||||||
exposure: local-only
|
|
||||||
offline_required: false
|
|
||||||
depends_on:
|
|
||||||
local: []
|
|
||||||
external: [homeassistant]
|
|
||||||
config:
|
|
||||||
target_url: http://100.70.180.90:8123 # chelsty-ha via Tailscale (HAOS, separate VM)
|
|
||||||
location_tag: "chelsty"
|
|
||||||
events_dir: /opt/homelab/events/chelsty-infra
|
|
||||||
runtime:
|
|
||||||
config_path: /opt/homelab/config/ha-diag-agent
|
|
||||||
data_path: /var/lib/ha-diag-agent
|
|
||||||
|
|
||||||
node-agent:
|
|
||||||
role: node-stability-monitor
|
|
||||||
# LTE node: node-agent monitors and emits events but does NO Docker cleanup.
|
|
||||||
# Disk pressure on chelsty-infra is typically Frigate recordings; Frigate's
|
|
||||||
# own retain policy is the correct remediation, not docker prune.
|
|
||||||
deployment_model: docker-compose
|
|
||||||
exposure: local-only
|
|
||||||
offline_required: true
|
|
||||||
|
|
||||||
mosquitto:
|
|
||||||
role: local-mqtt-broker
|
|
||||||
|
|
||||||
zigbee2mqtt:
|
|
||||||
role: zigbee-mqtt-bridge
|
|
||||||
|
|
||||||
frigate:
|
|
||||||
role: nvr
|
|
||||||
|
|
@ -1,6 +1,3 @@
|
||||||
host: chelsty-infra
|
|
||||||
site: chelsty
|
|
||||||
|
|
||||||
capabilities:
|
capabilities:
|
||||||
hardware:
|
hardware:
|
||||||
cpu:
|
cpu:
|
||||||
|
|
@ -34,11 +31,10 @@ capabilities:
|
||||||
power_constraint: low-power
|
power_constraint: low-power
|
||||||
connectivity: intermittent
|
connectivity: intermittent
|
||||||
availability_target: best-effort
|
availability_target: best-effort
|
||||||
offline_operation_required: true
|
|
||||||
|
|
||||||
deployment:
|
deployment:
|
||||||
suitability:
|
suitability:
|
||||||
- staging
|
- staging
|
||||||
- infra
|
- homeassistant
|
||||||
- edge
|
- edge
|
||||||
restricted: false
|
restricted: false
|
||||||
|
|
@ -1,10 +1,9 @@
|
||||||
hostname: chelsty-infra
|
hostname: chelsty
|
||||||
site: chelsty
|
|
||||||
|
|
||||||
roles:
|
roles:
|
||||||
- edge
|
- edge
|
||||||
- hypervisor
|
- hypervisor
|
||||||
- infra
|
- homeassistant
|
||||||
- staging
|
- staging
|
||||||
|
|
||||||
network:
|
network:
|
||||||
|
|
@ -1,4 +1,4 @@
|
||||||
host: chelsty-infra
|
host: chelsty
|
||||||
|
|
||||||
uplink:
|
uplink:
|
||||||
type: lte
|
type: lte
|
||||||
|
|
@ -20,7 +20,7 @@ exposure_classes:
|
||||||
|
|
||||||
networks:
|
networks:
|
||||||
home_automation_lan:
|
home_automation_lan:
|
||||||
purpose: MQTT broker, Zigbee coordinator, and local device control.
|
purpose: Home Assistant, MQTT, Zigbee coordinator, and local device control.
|
||||||
offline_required: true
|
offline_required: true
|
||||||
internet_required_for_core_operation: false
|
internet_required_for_core_operation: false
|
||||||
|
|
||||||
|
|
@ -1,4 +1,4 @@
|
||||||
host: chelsty-infra
|
host: chelsty
|
||||||
|
|
||||||
runtime_root: /opt/homelab
|
runtime_root: /opt/homelab
|
||||||
|
|
||||||
|
|
@ -9,6 +9,12 @@ conventions:
|
||||||
logs: /opt/homelab/logs
|
logs: /opt/homelab/logs
|
||||||
|
|
||||||
services:
|
services:
|
||||||
|
homeassistant:
|
||||||
|
data: /opt/homelab/data/homeassistant
|
||||||
|
config: /opt/homelab/config/homeassistant
|
||||||
|
logs: /opt/homelab/logs/homeassistant
|
||||||
|
backup_priority: critical
|
||||||
|
|
||||||
zigbee2mqtt:
|
zigbee2mqtt:
|
||||||
data: /opt/homelab/data/zigbee2mqtt
|
data: /opt/homelab/data/zigbee2mqtt
|
||||||
config: /opt/homelab/config/zigbee2mqtt
|
config: /opt/homelab/config/zigbee2mqtt
|
||||||
|
|
@ -21,13 +27,13 @@ services:
|
||||||
logs: /opt/homelab/logs/mosquitto
|
logs: /opt/homelab/logs/mosquitto
|
||||||
backup_priority: high
|
backup_priority: high
|
||||||
|
|
||||||
stability-agent:
|
|
||||||
data: /opt/homelab/state
|
|
||||||
config: /opt/homelab/config/stability-agent
|
|
||||||
logs: /opt/homelab/events
|
|
||||||
backup_priority: low
|
|
||||||
|
|
||||||
backup_sets:
|
backup_sets:
|
||||||
|
homeassistant:
|
||||||
|
include:
|
||||||
|
- /opt/homelab/config/homeassistant
|
||||||
|
- /opt/homelab/data/homeassistant
|
||||||
|
restore_note: Restore before starting the Home Assistant container.
|
||||||
|
|
||||||
zigbee2mqtt:
|
zigbee2mqtt:
|
||||||
include:
|
include:
|
||||||
- /opt/homelab/config/zigbee2mqtt
|
- /opt/homelab/config/zigbee2mqtt
|
||||||
|
|
@ -0,0 +1,13 @@
|
||||||
|
services:
|
||||||
|
zigbee2mqtt:
|
||||||
|
volumes:
|
||||||
|
- ./configuration.yaml:/app/data/configuration.yaml:ro
|
||||||
|
environment:
|
||||||
|
- MQTT_USER=${MQTT_USER}
|
||||||
|
- MQTT_PASSWORD=${MQTT_PASSWORD}
|
||||||
|
# Healthcheck is already defined in base service, but we ensure compatibility
|
||||||
|
healthcheck:
|
||||||
|
test: ["CMD", "curl", "-f", "http://localhost:8080"]
|
||||||
|
interval: 10s
|
||||||
|
timeout: 5s
|
||||||
|
retries: 3
|
||||||
108
hosts/chelsty/services.yaml
Normal file
108
hosts/chelsty/services.yaml
Normal file
|
|
@ -0,0 +1,108 @@
|
||||||
|
host: chelsty
|
||||||
|
|
||||||
|
exposure_classes:
|
||||||
|
local-only:
|
||||||
|
description: Reachable only from CHELSTY-local networks or container networks.
|
||||||
|
public_ingress: false
|
||||||
|
tailscale_required: false
|
||||||
|
tailscale-internal:
|
||||||
|
description: Reachable through the Tailscale mesh by approved tailnet clients.
|
||||||
|
public_ingress: false
|
||||||
|
tailscale_required: true
|
||||||
|
public:
|
||||||
|
description: Reachable from the public internet through an explicit ingress path.
|
||||||
|
public_ingress: true
|
||||||
|
tailscale_required: false
|
||||||
|
|
||||||
|
operational_constraints:
|
||||||
|
uplink: lte
|
||||||
|
connectivity: intermittent
|
||||||
|
offline_operation_required: true
|
||||||
|
must_not_depend_on:
|
||||||
|
- saturn
|
||||||
|
- vps
|
||||||
|
- forgejo
|
||||||
|
|
||||||
|
services:
|
||||||
|
homeassistant:
|
||||||
|
role: home-automation-controller
|
||||||
|
deployment_model: docker-compose
|
||||||
|
exposure: tailscale-internal
|
||||||
|
offline_required: true
|
||||||
|
depends_on:
|
||||||
|
local:
|
||||||
|
- mosquitto
|
||||||
|
- zigbee2mqtt
|
||||||
|
external: []
|
||||||
|
ports:
|
||||||
|
- name: http
|
||||||
|
container_port: 8123
|
||||||
|
protocol: tcp
|
||||||
|
runtime:
|
||||||
|
config_path: /opt/homelab/config/homeassistant
|
||||||
|
data_path: /opt/homelab/data/homeassistant
|
||||||
|
logs_path: /opt/homelab/logs/homeassistant
|
||||||
|
backup:
|
||||||
|
recommended: true
|
||||||
|
include:
|
||||||
|
- /opt/homelab/config/homeassistant
|
||||||
|
- /opt/homelab/data/homeassistant
|
||||||
|
notes:
|
||||||
|
- Back up before Home Assistant core, supervisor-equivalent, or integration upgrades.
|
||||||
|
- Keep local restore copies on CHELSTY because LTE connectivity may be unavailable during recovery.
|
||||||
|
|
||||||
|
zigbee2mqtt:
|
||||||
|
role: zigbee-mqtt-bridge
|
||||||
|
deployment_model: docker-compose
|
||||||
|
exposure: local-only
|
||||||
|
offline_required: true
|
||||||
|
depends_on:
|
||||||
|
local:
|
||||||
|
- mosquitto
|
||||||
|
external:
|
||||||
|
- slzb-06u
|
||||||
|
coordinator:
|
||||||
|
name: slzb-06u
|
||||||
|
connection: network
|
||||||
|
usb_device: null
|
||||||
|
ports:
|
||||||
|
- name: frontend
|
||||||
|
container_port: 8080
|
||||||
|
protocol: tcp
|
||||||
|
exposure: tailscale-internal
|
||||||
|
runtime:
|
||||||
|
config_path: /opt/homelab/config/zigbee2mqtt
|
||||||
|
data_path: /opt/homelab/data/zigbee2mqtt
|
||||||
|
logs_path: /opt/homelab/logs/zigbee2mqtt
|
||||||
|
backup:
|
||||||
|
recommended: true
|
||||||
|
include:
|
||||||
|
- /opt/homelab/config/zigbee2mqtt
|
||||||
|
- /opt/homelab/data/zigbee2mqtt
|
||||||
|
notes:
|
||||||
|
- Include configuration.yaml, database.db, coordinator backup files, and network key material.
|
||||||
|
- Restore Zigbee2MQTT state together with the SLZB-06U coordinator state when replacing hardware.
|
||||||
|
|
||||||
|
mosquitto:
|
||||||
|
role: local-mqtt-broker
|
||||||
|
deployment_model: docker-compose
|
||||||
|
exposure: local-only
|
||||||
|
offline_required: true
|
||||||
|
depends_on:
|
||||||
|
local: []
|
||||||
|
external: []
|
||||||
|
ports:
|
||||||
|
- name: mqtt
|
||||||
|
container_port: 1883
|
||||||
|
protocol: tcp
|
||||||
|
runtime:
|
||||||
|
config_path: /opt/homelab/config/mosquitto
|
||||||
|
data_path: /opt/homelab/data/mosquitto
|
||||||
|
logs_path: /opt/homelab/logs/mosquitto
|
||||||
|
backup:
|
||||||
|
recommended: true
|
||||||
|
include:
|
||||||
|
- /opt/homelab/config/mosquitto
|
||||||
|
- /opt/homelab/data/mosquitto
|
||||||
|
notes:
|
||||||
|
- Retain ACL, password, persistence, and bridge configuration if enabled.
|
||||||
|
|
@ -1,8 +0,0 @@
|
||||||
services:
|
|
||||||
runtime-materializer:
|
|
||||||
environment:
|
|
||||||
# Pull world state from the VPS control-plane API instead of local Redis.
|
|
||||||
# The observer on VPS is the authoritative writer; mirroring its API output
|
|
||||||
# here ensures the webui /snapshot matches the clean 97-service state that
|
|
||||||
# the control-plane /summary endpoint serves.
|
|
||||||
CONTROL_PLANE_URL: "http://100.95.58.48:18180"
|
|
||||||
|
|
@ -1,4 +0,0 @@
|
||||||
services:
|
|
||||||
brain-watchdog:
|
|
||||||
mem_limit: 64m
|
|
||||||
restart: unless-stopped
|
|
||||||
|
|
@ -1,11 +0,0 @@
|
||||||
services:
|
|
||||||
node-agent:
|
|
||||||
environment:
|
|
||||||
- NODE_NAME=piha
|
|
||||||
- NODE_TYPE=sd_card
|
|
||||||
- VPS_EVENTS_HOST=100.95.58.48
|
|
||||||
- VPS_EVENTS_USER=oskar
|
|
||||||
- VPS_EVENTS_PATH=/opt/homelab/events
|
|
||||||
- CHECK_INTERVAL=60
|
|
||||||
volumes:
|
|
||||||
- /home/oskar/.ssh:/root/.ssh:ro
|
|
||||||
|
|
@ -1,7 +0,0 @@
|
||||||
services:
|
|
||||||
stability-agent:
|
|
||||||
environment:
|
|
||||||
- NODE_NAME=piha
|
|
||||||
- REDIS_HOST=100.108.208.3
|
|
||||||
- REDIS_PORT=6379
|
|
||||||
- REDIS_ENABLED=true
|
|
||||||
|
|
@ -1,42 +0,0 @@
|
||||||
host: piha
|
|
||||||
|
|
||||||
services:
|
|
||||||
ha-diag-agent:
|
|
||||||
role: ha-diagnostic-agent
|
|
||||||
deployment_model: docker-compose
|
|
||||||
exposure: local-only
|
|
||||||
offline_required: false
|
|
||||||
depends_on:
|
|
||||||
local: []
|
|
||||||
external: [homeassistant]
|
|
||||||
config:
|
|
||||||
target_url: http://localhost:8123
|
|
||||||
location_tag: "ken"
|
|
||||||
events_dir: /opt/homelab/events/piha
|
|
||||||
runtime:
|
|
||||||
config_path: /opt/homelab/config/ha-diag-agent
|
|
||||||
data_path: /var/lib/ha-diag-agent
|
|
||||||
|
|
||||||
node-agent:
|
|
||||||
role: node-stability-monitor
|
|
||||||
deployment_model: docker-compose
|
|
||||||
exposure: local-only
|
|
||||||
offline_required: true
|
|
||||||
depends_on:
|
|
||||||
local: []
|
|
||||||
external: []
|
|
||||||
runtime:
|
|
||||||
config_path: /opt/homelab/config/node-agent
|
|
||||||
data_path: /opt/homelab/state
|
|
||||||
logs_path: /opt/homelab/events
|
|
||||||
|
|
||||||
brain-watchdog:
|
|
||||||
role: control-plane-watchdog
|
|
||||||
deployment_model: docker-compose
|
|
||||||
exposure: private
|
|
||||||
offline_required: false
|
|
||||||
depends_on:
|
|
||||||
local: []
|
|
||||||
external: [control-plane]
|
|
||||||
runtime:
|
|
||||||
config_path: /opt/homelab/config/brain-watchdog
|
|
||||||
|
|
@ -1,11 +0,0 @@
|
||||||
services:
|
|
||||||
node-agent:
|
|
||||||
environment:
|
|
||||||
- NODE_NAME=solaria
|
|
||||||
- NODE_TYPE=ai_node
|
|
||||||
- VPS_EVENTS_HOST=100.95.58.48
|
|
||||||
- VPS_EVENTS_USER=oskar
|
|
||||||
- VPS_EVENTS_PATH=/opt/homelab/events
|
|
||||||
- CHECK_INTERVAL=60
|
|
||||||
volumes:
|
|
||||||
- /home/oskar/.ssh:/root/.ssh:ro
|
|
||||||
|
|
@ -1,7 +0,0 @@
|
||||||
services:
|
|
||||||
stability-agent:
|
|
||||||
environment:
|
|
||||||
- NODE_NAME=solaria
|
|
||||||
- REDIS_HOST=100.108.208.3
|
|
||||||
- REDIS_PORT=6379
|
|
||||||
- REDIS_ENABLED=true
|
|
||||||
|
|
@ -1,15 +0,0 @@
|
||||||
host: solaria
|
|
||||||
|
|
||||||
services:
|
|
||||||
node-agent:
|
|
||||||
role: node-stability-monitor
|
|
||||||
deployment_model: docker-compose
|
|
||||||
exposure: local-only
|
|
||||||
offline_required: true
|
|
||||||
depends_on:
|
|
||||||
local: []
|
|
||||||
external: []
|
|
||||||
runtime:
|
|
||||||
config_path: /opt/homelab/config/node-agent
|
|
||||||
data_path: /opt/homelab/state
|
|
||||||
logs_path: /opt/homelab/events
|
|
||||||
|
|
@ -1,39 +0,0 @@
|
||||||
# Control-plane production overrides for the VPS deployment.
|
|
||||||
#
|
|
||||||
# NODE_ALIAS_MAP translates the node names that appear in raw event files
|
|
||||||
# (written by node agents / seed scripts) to the canonical names used in
|
|
||||||
# inventory/topology.yaml and hosts/*/services.yaml.
|
|
||||||
#
|
|
||||||
# Current live mapping (from /opt/homelab/events/ inspection):
|
|
||||||
# node-2 → chelsty (zigbee2mqtt / mosquitto / homeassistant node)
|
|
||||||
#
|
|
||||||
# Add further entries when new nodes come online and their event-source names
|
|
||||||
# differ from their topology names. Format is a single-line JSON object, e.g.:
|
|
||||||
# NODE_ALIAS_MAP='{"node-2":"chelsty","node-3":"piha"}'
|
|
||||||
#
|
|
||||||
# The executor inherits the canonical name from the action JSON written by the
|
|
||||||
# supervisor, so NODE_ALIAS_MAP is only required on the supervisor service.
|
|
||||||
#
|
|
||||||
# Memory limits: VPS has 4 GiB RAM, no swap. oom_score_adj -900 ensures the
|
|
||||||
# host kernel OOM-killer never targets control-plane containers. mem_limit
|
|
||||||
# provides a per-container cgroup ceiling so a leaking process is restarted by
|
|
||||||
# Docker before it can exhaust host memory.
|
|
||||||
|
|
||||||
services:
|
|
||||||
operator-ui:
|
|
||||||
mem_limit: 192m
|
|
||||||
oom_score_adj: -900
|
|
||||||
|
|
||||||
observer:
|
|
||||||
mem_limit: 192m
|
|
||||||
oom_score_adj: -900
|
|
||||||
|
|
||||||
supervisor:
|
|
||||||
mem_limit: 400m
|
|
||||||
oom_score_adj: -900
|
|
||||||
environment:
|
|
||||||
- NODE_ALIAS_MAP={"node-2":"chelsty"}
|
|
||||||
|
|
||||||
executor:
|
|
||||||
mem_limit: 64m
|
|
||||||
oom_score_adj: -900
|
|
||||||
|
|
@ -1,7 +0,0 @@
|
||||||
# Control Plane Environment Variables
|
|
||||||
PORT=8080
|
|
||||||
HOMELAB_STATE_ROOT=/opt/homelab/state
|
|
||||||
HOMELAB_EVENTS_ROOT=/opt/homelab/events
|
|
||||||
HOMELAB_WORLD_ROOT=/opt/homelab/world
|
|
||||||
HOMELAB_ACTIONS_ROOT=/opt/homelab/actions
|
|
||||||
HOMELAB_CONFIG_ROOT=/opt/homelab/config
|
|
||||||
|
|
@ -1,16 +0,0 @@
|
||||||
services:
|
|
||||||
node-agent:
|
|
||||||
environment:
|
|
||||||
- NODE_NAME=vps
|
|
||||||
- CHECK_INTERVAL=60
|
|
||||||
# host network mode: node-agent on VPS shares the host's network namespace
|
|
||||||
# so that localhost:18180 resolves to the control-plane's exposed port.
|
|
||||||
# Without this, localhost inside the container is the container's own loopback
|
|
||||||
# and the _check_control_plane_health() probe would always fail.
|
|
||||||
network_mode: host
|
|
||||||
# HARD memory ceiling: node-agent mounts /opt/homelab/events/ (page cache)
|
|
||||||
# and may accumulate Python RSS over hours; 640m cap ensures it is killed and
|
|
||||||
# auto-restarted by Docker before consuming host memory. oom_score_adj -900
|
|
||||||
# prevents the host kernel OOM-killer from picking it as a global victim.
|
|
||||||
mem_limit: 640m
|
|
||||||
oom_score_adj: -900
|
|
||||||
|
|
@ -1,9 +0,0 @@
|
||||||
services:
|
|
||||||
stability-agent:
|
|
||||||
environment:
|
|
||||||
- NODE_NAME=vps
|
|
||||||
- REDIS_HOST=100.108.208.3
|
|
||||||
- REDIS_PORT=6379
|
|
||||||
- REDIS_ENABLED=true
|
|
||||||
mem_limit: 96m
|
|
||||||
oom_score_adj: -900
|
|
||||||
1
hosts/vps/services.txt
Normal file
1
hosts/vps/services.txt
Normal file
|
|
@ -0,0 +1 @@
|
||||||
|
npm
|
||||||
|
|
@ -1,43 +0,0 @@
|
||||||
host: vps
|
|
||||||
|
|
||||||
services:
|
|
||||||
node-agent:
|
|
||||||
role: node-stability-monitor
|
|
||||||
deployment_model: docker-compose
|
|
||||||
exposure: local-only
|
|
||||||
offline_required: true
|
|
||||||
depends_on:
|
|
||||||
local: []
|
|
||||||
external: []
|
|
||||||
runtime:
|
|
||||||
config_path: /opt/homelab/config/node-agent
|
|
||||||
data_path: /opt/homelab/state
|
|
||||||
logs_path: /opt/homelab/events
|
|
||||||
|
|
||||||
control-plane:
|
|
||||||
role: management-and-orchestration
|
|
||||||
deployment_model: docker-compose
|
|
||||||
exposure: tailscale-internal
|
|
||||||
offline_required: false
|
|
||||||
depends_on:
|
|
||||||
local:
|
|
||||||
- node-agent
|
|
||||||
external:
|
|
||||||
- piha:redis
|
|
||||||
ports:
|
|
||||||
- name: http
|
|
||||||
container_port: 18180
|
|
||||||
protocol: tcp
|
|
||||||
runtime:
|
|
||||||
config_path: /opt/homelab/config/control-plane
|
|
||||||
data_path: /opt/homelab/data/control-plane
|
|
||||||
logs_path: /opt/homelab/logs/control-plane
|
|
||||||
|
|
||||||
node_exporter:
|
|
||||||
role: metrics-exporter
|
|
||||||
deployment_model: docker-compose
|
|
||||||
exposure: local-only
|
|
||||||
offline_required: true
|
|
||||||
depends_on:
|
|
||||||
local: []
|
|
||||||
external: []
|
|
||||||
|
|
@ -17,10 +17,6 @@ nodes:
|
||||||
roles:
|
roles:
|
||||||
- infra
|
- infra
|
||||||
- monitoring
|
- monitoring
|
||||||
services:
|
|
||||||
- node-agent
|
|
||||||
- ha-diag-agent
|
|
||||||
- brain-watchdog
|
|
||||||
|
|
||||||
solaria:
|
solaria:
|
||||||
roles:
|
roles:
|
||||||
|
|
@ -31,25 +27,12 @@ nodes:
|
||||||
roles:
|
roles:
|
||||||
- edge
|
- edge
|
||||||
- ingress
|
- ingress
|
||||||
- control-plane
|
|
||||||
services:
|
|
||||||
# Repo-managed GitOps services (hosts/vps/services.yaml is authoritative)
|
|
||||||
- node-agent
|
|
||||||
- control-plane # executor, observer, supervisor, operator-ui
|
|
||||||
- node_exporter
|
|
||||||
- stability-agent
|
|
||||||
- npm # Nginx Proxy Manager — public ingress, TLS termination
|
|
||||||
- outline # Team wiki (outline + postgres + redis)
|
|
||||||
- joplin # Note sync server (joplin-server + postgres)
|
|
||||||
- ai-cluster # AI workers: codex-worker, openclaw, planner-worker,
|
|
||||||
# service-ops-worker, redis, mosquitto
|
|
||||||
|
|
||||||
chelsty-infra:
|
chelsty:
|
||||||
site: chelsty
|
|
||||||
roles:
|
roles:
|
||||||
- remote
|
- remote
|
||||||
- hypervisor
|
- hypervisor
|
||||||
- infra
|
- homeassistant
|
||||||
- staging
|
- staging
|
||||||
connectivity:
|
connectivity:
|
||||||
uplink: lte
|
uplink: lte
|
||||||
|
|
@ -57,22 +40,10 @@ nodes:
|
||||||
home_automation:
|
home_automation:
|
||||||
offline_operation_required: true
|
offline_operation_required: true
|
||||||
services:
|
services:
|
||||||
|
- homeassistant
|
||||||
- zigbee2mqtt
|
- zigbee2mqtt
|
||||||
- mosquitto
|
- mosquitto
|
||||||
coordinator:
|
coordinator:
|
||||||
model: SLZB-06U
|
model: SLZB-06U
|
||||||
connection: network
|
connection: network
|
||||||
usb: false
|
usb: false
|
||||||
|
|
||||||
chelsty-ha:
|
|
||||||
site: chelsty
|
|
||||||
roles:
|
|
||||||
- remote
|
|
||||||
- homeassistant
|
|
||||||
connectivity:
|
|
||||||
uplink: lte
|
|
||||||
intermittent: true
|
|
||||||
home_automation:
|
|
||||||
offline_operation_required: true
|
|
||||||
services:
|
|
||||||
- homeassistant
|
|
||||||
|
|
|
||||||
|
|
@ -1,75 +0,0 @@
|
||||||
#!/usr/bin/env bash
|
|
||||||
# vps-control-plane.sh - Bootstrap script for VPS control plane
|
|
||||||
|
|
||||||
set -e
|
|
||||||
|
|
||||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
|
||||||
REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
|
|
||||||
RUNTIME_DIR="/opt/homelab"
|
|
||||||
VPS_CONFIG="$REPO_ROOT/hosts/vps/runtime"
|
|
||||||
|
|
||||||
# Colors for output
|
|
||||||
RED='\033[0;31m'
|
|
||||||
GREEN='\033[0;32m'
|
|
||||||
YELLOW='\033[1;33m'
|
|
||||||
NC='\033[0m' # No Color
|
|
||||||
|
|
||||||
log() { echo -e "${GREEN}[INFO]${NC} $1"; }
|
|
||||||
warn() { echo -e "${YELLOW}[WARN]${NC} $1"; }
|
|
||||||
error() { echo -e "${RED}[ERROR]${NC} $1"; exit 1; }
|
|
||||||
|
|
||||||
log "Starting VPS control plane bootstrap..."
|
|
||||||
|
|
||||||
# 1. Validate Docker availability
|
|
||||||
if ! command -v docker &> /dev/null; then
|
|
||||||
error "Docker is not installed. Please install Docker first."
|
|
||||||
fi
|
|
||||||
|
|
||||||
# 2. Validate compose plugin
|
|
||||||
if ! docker compose version &> /dev/null; then
|
|
||||||
error "Docker Compose plugin is not installed."
|
|
||||||
fi
|
|
||||||
|
|
||||||
log "Docker and Compose plugin verified."
|
|
||||||
|
|
||||||
# 3. Create filesystem-first runtime structure
|
|
||||||
log "Creating filesystem-first runtime structure in $RUNTIME_DIR..."
|
|
||||||
sudo mkdir -p "$RUNTIME_DIR/events" \
|
|
||||||
"$RUNTIME_DIR/state" \
|
|
||||||
"$RUNTIME_DIR/world" \
|
|
||||||
"$RUNTIME_DIR/actions/pending" \
|
|
||||||
"$RUNTIME_DIR/actions/approved" \
|
|
||||||
"$RUNTIME_DIR/actions/running" \
|
|
||||||
"$RUNTIME_DIR/actions/completed" \
|
|
||||||
"$RUNTIME_DIR/actions/failed" \
|
|
||||||
"$RUNTIME_DIR/actions/rejected" \
|
|
||||||
"$RUNTIME_DIR/config" \
|
|
||||||
"$RUNTIME_DIR/logs"
|
|
||||||
|
|
||||||
# 4. Set permissions
|
|
||||||
log "Setting permissions..."
|
|
||||||
sudo chown -R $USER:$USER "$RUNTIME_DIR"
|
|
||||||
chmod -R 755 "$RUNTIME_DIR"
|
|
||||||
|
|
||||||
# 5. Install environment file
|
|
||||||
log "Installing environment configuration..."
|
|
||||||
if [ ! -f "$RUNTIME_DIR/config/control-plane.env" ]; then
|
|
||||||
cp "$VPS_CONFIG/control-plane/env.example" "$RUNTIME_DIR/config/control-plane.env"
|
|
||||||
log "Created $RUNTIME_DIR/config/control-plane.env from template."
|
|
||||||
else
|
|
||||||
warn "Environment file already exists, skipping installation."
|
|
||||||
fi
|
|
||||||
|
|
||||||
# 6. Build and start the control plane
|
|
||||||
log "Building and starting control plane services..."
|
|
||||||
cd "$REPO_ROOT/services/control-plane"
|
|
||||||
docker compose build
|
|
||||||
docker compose up -d
|
|
||||||
|
|
||||||
log "VPS control plane bootstrap complete!"
|
|
||||||
|
|
||||||
echo -e "\n${YELLOW}Verification commands:${NC}"
|
|
||||||
echo "1. Check container status: docker compose ps"
|
|
||||||
echo "2. Check operator UI: curl http://localhost:8080/summary"
|
|
||||||
echo "3. Validate world state: ls -l $RUNTIME_DIR/world"
|
|
||||||
echo "4. Monitor events: tail -f $RUNTIME_DIR/events/*/*/*.json"
|
|
||||||
|
|
@ -1,23 +0,0 @@
|
||||||
#!/bin/bash
|
|
||||||
# scripts/deploy/deploy-control-plane.sh
|
|
||||||
set -e
|
|
||||||
|
|
||||||
VPS_IP="100.95.58.48"
|
|
||||||
USER="oskar"
|
|
||||||
REMOTE_REPO_PATH="/home/oskar/homelab-codex-ws"
|
|
||||||
|
|
||||||
MODE=$1
|
|
||||||
|
|
||||||
case "$MODE" in
|
|
||||||
"--ssh")
|
|
||||||
echo "Deploying to VPS ($VPS_IP) via SSH..."
|
|
||||||
ssh -t "$USER@$VPS_IP" "cd $REMOTE_REPO_PATH && git pull origin master && cd services/control-plane && bash deploy-local.sh"
|
|
||||||
;;
|
|
||||||
"--print")
|
|
||||||
echo "ssh -t $USER@$VPS_IP \"cd $REMOTE_REPO_PATH && git pull origin master && cd services/control-plane && bash deploy-local.sh\""
|
|
||||||
;;
|
|
||||||
*)
|
|
||||||
echo "Usage: $0 [--ssh|--print]"
|
|
||||||
exit 1
|
|
||||||
;;
|
|
||||||
esac
|
|
||||||
|
|
@ -1,26 +0,0 @@
|
||||||
#!/usr/bin/env bash
|
|
||||||
# deploy-frigate.sh - Deploy Frigate NVR on chelsty-infra (print or SSH)
|
|
||||||
|
|
||||||
MODE="print"
|
|
||||||
[[ "$1" == "--ssh" ]] && MODE="ssh"
|
|
||||||
|
|
||||||
TARGET="100.122.201.22"
|
|
||||||
NODE="chelsty-infra"
|
|
||||||
REPO_PATH="/home/oskar/homelab-codex-ws"
|
|
||||||
SERVICE_PATH="$REPO_PATH/hosts/chelsty-infra/runtime/frigate"
|
|
||||||
|
|
||||||
echo "HOST: $NODE"
|
|
||||||
echo "MODE: $MODE"
|
|
||||||
echo "TARGET: $TARGET"
|
|
||||||
|
|
||||||
# Secrets must exist at /opt/homelab/config/frigate/frigate.env on the node
|
|
||||||
# before first deploy. See config.yml for required variables.
|
|
||||||
DEPLOY_CMD="cd $REPO_PATH && git fetch origin && git checkout master && git pull origin master && cd $SERVICE_PATH && docker-compose pull && docker-compose up -d"
|
|
||||||
|
|
||||||
if [[ "$MODE" == "ssh" ]]; then
|
|
||||||
echo "--- Deploying Frigate to $NODE ($TARGET) via SSH ---"
|
|
||||||
ssh oskar@$TARGET "$DEPLOY_CMD"
|
|
||||||
else
|
|
||||||
echo "# --- Deployment commands for $NODE ---"
|
|
||||||
echo "ssh oskar@$TARGET '$DEPLOY_CMD'"
|
|
||||||
fi
|
|
||||||
|
|
@ -8,7 +8,6 @@ set -e
|
||||||
REPO_PATH="${HOME}/homelab-codex-ws"
|
REPO_PATH="${HOME}/homelab-codex-ws"
|
||||||
RUNTIME_PATH="/opt/homelab"
|
RUNTIME_PATH="/opt/homelab"
|
||||||
HOSTNAME=$(hostname | tr '[:lower:]' '[:upper:]')
|
HOSTNAME=$(hostname | tr '[:lower:]' '[:upper:]')
|
||||||
HOST_DIR="${REPO_PATH}/hosts/$(hostname | tr '[:upper:]' '[:lower:]')"
|
|
||||||
|
|
||||||
echo "--- Starting Deployment on ${HOSTNAME} ---"
|
echo "--- Starting Deployment on ${HOSTNAME} ---"
|
||||||
|
|
||||||
|
|
@ -23,33 +22,20 @@ echo "Pulling latest changes..."
|
||||||
git pull
|
git pull
|
||||||
|
|
||||||
# 2. Identify Services
|
# 2. Identify Services
|
||||||
SERVICES=()
|
# Based on our convention, we look for services assigned to this host
|
||||||
if [ -f "${HOST_DIR}/services.txt" ]; then
|
# For now, we'll check if a 'services.txt' exists in the host folder
|
||||||
mapfile -t SERVICES < <(grep -v '^\s*#' "${HOST_DIR}/services.txt" | grep -v '^\s*$')
|
SERVICE_LIST="${REPO_PATH}/hosts/$(hostname | tr '[:upper:]' '[:lower:]')/services.txt"
|
||||||
elif [ -f "${HOST_DIR}/services.yaml" ]; then
|
|
||||||
SERVICES=($(python3 -c "
|
|
||||||
import yaml, sys
|
|
||||||
try:
|
|
||||||
with open('${HOST_DIR}/services.yaml', 'r') as f:
|
|
||||||
data = yaml.safe_load(f)
|
|
||||||
if data and 'services' in data:
|
|
||||||
if isinstance(data['services'], dict):
|
|
||||||
print(' '.join(data['services'].keys()))
|
|
||||||
elif isinstance(data['services'], list):
|
|
||||||
print(' '.join(data['services']))
|
|
||||||
except Exception as e:
|
|
||||||
print(f'Error parsing YAML: {e}', file=sys.stderr)
|
|
||||||
sys.exit(1)
|
|
||||||
"))
|
|
||||||
fi
|
|
||||||
|
|
||||||
if [ ${#SERVICES[@]} -eq 0 ]; then
|
if [ ! -f "$SERVICE_LIST" ]; then
|
||||||
echo "No services found for ${HOSTNAME}. Skipping service deployment."
|
echo "No services.txt found for ${HOSTNAME}. Skipping service deployment."
|
||||||
exit 0
|
exit 0
|
||||||
fi
|
fi
|
||||||
|
|
||||||
# 3. Deploy Services
|
# 3. Deploy Services
|
||||||
for service in "${SERVICES[@]}"; do
|
while IFS= read -r service || [ -n "$service" ]; do
|
||||||
|
[[ "$service" =~ ^#.*$ ]] && continue # Skip comments
|
||||||
|
[[ -z "$service" ]] && continue # Skip empty lines
|
||||||
|
|
||||||
echo "Deploying service: ${service}..."
|
echo "Deploying service: ${service}..."
|
||||||
|
|
||||||
COMPOSE_FILE="${REPO_PATH}/services/${service}/docker-compose.yml"
|
COMPOSE_FILE="${REPO_PATH}/services/${service}/docker-compose.yml"
|
||||||
|
|
@ -59,10 +45,13 @@ for service in "${SERVICES[@]}"; do
|
||||||
continue
|
continue
|
||||||
fi
|
fi
|
||||||
|
|
||||||
|
# Target directory in runtime
|
||||||
TARGET_DIR="${RUNTIME_PATH}/services/${service}"
|
TARGET_DIR="${RUNTIME_PATH}/services/${service}"
|
||||||
mkdir -p "$TARGET_DIR"
|
mkdir -p "$TARGET_DIR"
|
||||||
|
|
||||||
OVERRIDE_FILE="${HOST_DIR}/runtime/${service}/docker-compose.override.yml"
|
# We use the compose file from the repo directly
|
||||||
|
# but we can also handle overrides here
|
||||||
|
OVERRIDE_FILE="${RUNTIME_PATH}/config/${service}/docker-compose.override.yml"
|
||||||
|
|
||||||
COMPOSE_CMD="docker compose -f ${COMPOSE_FILE}"
|
COMPOSE_CMD="docker compose -f ${COMPOSE_FILE}"
|
||||||
if [ -f "$OVERRIDE_FILE" ]; then
|
if [ -f "$OVERRIDE_FILE" ]; then
|
||||||
|
|
@ -71,6 +60,7 @@ for service in "${SERVICES[@]}"; do
|
||||||
fi
|
fi
|
||||||
|
|
||||||
$COMPOSE_CMD up -d --remove-orphans
|
$COMPOSE_CMD up -d --remove-orphans
|
||||||
done
|
|
||||||
|
done < "$SERVICE_LIST"
|
||||||
|
|
||||||
echo "--- Deployment Complete ---"
|
echo "--- Deployment Complete ---"
|
||||||
|
|
|
||||||
|
|
@ -1,55 +0,0 @@
|
||||||
#!/usr/bin/env bash
|
|
||||||
# deploy-stability-agent.sh - Helper to deploy stability-agent (print or SSH)
|
|
||||||
|
|
||||||
NODE=$1
|
|
||||||
MODE="print"
|
|
||||||
[[ "$2" == "--ssh" ]] && MODE="ssh"
|
|
||||||
|
|
||||||
if [[ -z "$NODE" ]]; then
|
|
||||||
echo "Usage: $0 <node-name> [--ssh]"
|
|
||||||
echo "Supported nodes: chelsty, piha, solaria, vps"
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
|
|
||||||
case "$NODE" in
|
|
||||||
piha) TARGET="100.108.208.3" ;;
|
|
||||||
chelsty) TARGET="100.122.201.22" ;;
|
|
||||||
vps) TARGET="100.95.58.48" ;;
|
|
||||||
solaria) TARGET="local" ;;
|
|
||||||
*)
|
|
||||||
echo "Error: Unknown node '$NODE'"
|
|
||||||
echo "Supported nodes: chelsty, piha, solaria, vps"
|
|
||||||
exit 1
|
|
||||||
;;
|
|
||||||
esac
|
|
||||||
|
|
||||||
echo "HOST: $NODE"
|
|
||||||
echo "MODE: $MODE"
|
|
||||||
echo "TARGET: $TARGET"
|
|
||||||
|
|
||||||
REPO_PATH="/home/oskar/homelab-codex-ws"
|
|
||||||
|
|
||||||
if [[ "$NODE" == "solaria" ]]; then
|
|
||||||
if [[ "$MODE" == "ssh" ]]; then
|
|
||||||
echo "--- Running local deployment for solaria ---"
|
|
||||||
cd "$REPO_PATH" && git fetch origin && git checkout master && git pull origin master && cd services/stability-agent && ./deploy-local.sh solaria
|
|
||||||
else
|
|
||||||
echo "# --- Deployment commands for solaria ---"
|
|
||||||
echo "cd $REPO_PATH"
|
|
||||||
echo "git fetch origin"
|
|
||||||
echo "git checkout master"
|
|
||||||
echo "git pull origin master"
|
|
||||||
echo "cd services/stability-agent"
|
|
||||||
echo "./deploy-local.sh solaria"
|
|
||||||
fi
|
|
||||||
else
|
|
||||||
# Remote nodes
|
|
||||||
SSH_CMD="ssh oskar@$TARGET 'cd $REPO_PATH && git fetch origin && git checkout master && git pull origin master && cd services/stability-agent && ./deploy-local.sh $NODE'"
|
|
||||||
if [[ "$MODE" == "ssh" ]]; then
|
|
||||||
echo "--- Deploying to $NODE ($TARGET) via SSH ---"
|
|
||||||
eval "$SSH_CMD"
|
|
||||||
else
|
|
||||||
echo "# --- Deployment commands for $NODE ---"
|
|
||||||
echo "$SSH_CMD"
|
|
||||||
fi
|
|
||||||
fi
|
|
||||||
|
|
@ -1,321 +1,270 @@
|
||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
# scripts/deploy/deploy.sh — Saturn-side deploy dispatcher
|
# deploy.sh - Staged deployment framework for homelab nodes.
|
||||||
# Usage: deploy.sh <target> [--dry-run] [--no-gate]
|
|
||||||
# target ∈ {control-plane, vps, piha, solaria, chelsty-infra}
|
|
||||||
# Exit codes: 0=ok 1=preflight 2=gate 3=execute 4=verify 5=handoff(sudo)
|
|
||||||
|
|
||||||
set -uo pipefail
|
set -o pipefail
|
||||||
|
|
||||||
REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
|
# --- Configuration ---
|
||||||
SSH_USER="${SSH_USER:-oskar}"
|
export RUNTIME_PATH="/opt/homelab"
|
||||||
START_TIME=$(date +%s)
|
export STATE_DIR="${RUNTIME_PATH}/state/deploy"
|
||||||
TARGET=""
|
export LOG_DIR="${RUNTIME_PATH}/logs/deploy"
|
||||||
DRY_RUN=false
|
export REPO_PATH="${HOME}/homelab-codex-ws"
|
||||||
NO_GATE=false
|
export TIMESTAMP=$(date +%Y%m%d_%H%M%S)
|
||||||
|
export LOG_FILE="${LOG_DIR}/deploy_${TIMESTAMP}.log"
|
||||||
|
|
||||||
usage() {
|
# --- Initialization ---
|
||||||
cat >&2 <<'EOF'
|
mkdir -p "$STATE_DIR" "$LOG_DIR"
|
||||||
Usage: deploy.sh <target> [--dry-run] [--no-gate]
|
|
||||||
|
|
||||||
Targets:
|
# Redirection for logging
|
||||||
control-plane observer/supervisor/executor/operator-ui on VPS
|
exec > >(tee -a "$LOG_FILE") 2>&1
|
||||||
vps all VPS GitOps services
|
|
||||||
piha PIHA services
|
|
||||||
solaria SOLARIA compute services
|
|
||||||
chelsty-infra CHELSTY edge node (LTE, longer SSH timeout)
|
|
||||||
|
|
||||||
Flags:
|
# --- Load Libraries ---
|
||||||
--dry-run run preflight + gate only; stop before deploy
|
LIB_PATH="${REPO_PATH}/scripts/lib"
|
||||||
--no-gate skip pytest + docker build (emergency only; logged as WARNING)
|
source "${LIB_PATH}/log.sh"
|
||||||
|
source "${LIB_PATH}/state.sh"
|
||||||
|
source "${LIB_PATH}/inventory.sh"
|
||||||
|
source "${LIB_PATH}/compose.sh"
|
||||||
|
source "${LIB_PATH}/diagnostics.sh"
|
||||||
|
|
||||||
Exit codes: 0=ok 1=preflight 2=gate 3=execute 4=verify 5=handoff(sudo)
|
# --- CLI Parsing ---
|
||||||
EOF
|
TARGET_HOST=$(hostname)
|
||||||
exit 1
|
TARGET_SERVICE=""
|
||||||
}
|
RESUME=false
|
||||||
|
REQUESTED_STAGE=""
|
||||||
|
|
||||||
while [[ $# -gt 0 ]]; do
|
while [[ $# -gt 0 ]]; do
|
||||||
case $1 in
|
case $1 in
|
||||||
control-plane|vps|piha|solaria|chelsty-infra)
|
--host)
|
||||||
TARGET="$1"; shift ;;
|
TARGET_HOST="$2"
|
||||||
--dry-run)
|
shift 2
|
||||||
DRY_RUN=true; shift ;;
|
;;
|
||||||
--no-gate)
|
--service)
|
||||||
NO_GATE=true; shift ;;
|
TARGET_SERVICE="$2"
|
||||||
-h|--help)
|
shift 2
|
||||||
usage ;;
|
;;
|
||||||
|
--resume)
|
||||||
|
RESUME=true
|
||||||
|
shift
|
||||||
|
;;
|
||||||
|
--stage)
|
||||||
|
REQUESTED_STAGE="$2"
|
||||||
|
shift 2
|
||||||
|
;;
|
||||||
*)
|
*)
|
||||||
echo "Unknown argument: $1" >&2
|
if [[ "$1" =~ ^(prepare|validate|deploy|verify|diagnose|complete)$ ]]; then
|
||||||
usage ;;
|
REQUESTED_STAGE="$1"
|
||||||
|
fi
|
||||||
|
shift
|
||||||
|
;;
|
||||||
esac
|
esac
|
||||||
done
|
done
|
||||||
|
|
||||||
[[ -z "$TARGET" ]] && { echo "Error: target is required." >&2; usage; }
|
# --- Stages ---
|
||||||
|
|
||||||
case "$TARGET" in
|
stage_prepare() {
|
||||||
control-plane) SSH_HOST="vps" ;;
|
local host=$1
|
||||||
*) SSH_HOST="$TARGET" ;;
|
if is_stage_complete "prepare" && [[ "$RESUME" == "true" ]]; then
|
||||||
esac
|
log "INFO" "Skipping PREPARE (already complete)"
|
||||||
|
return 0
|
||||||
case "$TARGET" in
|
|
||||||
chelsty-*) SSH_TIMEOUT=30 ;;
|
|
||||||
*) SSH_TIMEOUT=5 ;;
|
|
||||||
esac
|
|
||||||
|
|
||||||
# ── PREFLIGHT ────────────────────────────────────────────────────────────────
|
|
||||||
|
|
||||||
preflight() {
|
|
||||||
echo "=== PREFLIGHT ==="
|
|
||||||
|
|
||||||
local branch
|
|
||||||
branch=$(git -C "$REPO_ROOT" rev-parse --abbrev-ref HEAD)
|
|
||||||
if [[ "$branch" != "master" ]]; then
|
|
||||||
echo "ERROR: On branch '${branch}', not master. Switch to master and push first." >&2
|
|
||||||
exit 1
|
|
||||||
fi
|
fi
|
||||||
echo "[ok] branch: master"
|
|
||||||
|
|
||||||
if ! git -C "$REPO_ROOT" diff --quiet; then
|
log "INFO" "Stage: PREPARE ($host)"
|
||||||
echo "ERROR: Unstaged changes in working tree. Commit or stash before deploying." >&2
|
set_stage "prepare"
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
if ! git -C "$REPO_ROOT" diff --cached --quiet; then
|
|
||||||
echo "ERROR: Staged but uncommitted changes. Commit before deploying." >&2
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
echo "[ok] working tree clean"
|
|
||||||
|
|
||||||
git -C "$REPO_ROOT" fetch origin master --quiet
|
emit_event "deployment_started" "info" "deploy.sh" "all" "${TIMESTAMP}" "{\"stage\": \"prepare\"}"
|
||||||
local unpushed
|
|
||||||
unpushed=$(git -C "$REPO_ROOT" log origin/master..HEAD --oneline)
|
|
||||||
if [[ -n "$unpushed" ]]; then
|
|
||||||
echo "ERROR: Unpushed commits on master:" >&2
|
|
||||||
echo "$unpushed" >&2
|
|
||||||
echo "Push first: git push origin master" >&2
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
echo "[ok] no unpushed commits"
|
|
||||||
|
|
||||||
echo "Checking SSH: ${SSH_USER}@${SSH_HOST} (ConnectTimeout=${SSH_TIMEOUT}s)..."
|
cd "$REPO_PATH" || exit 1
|
||||||
if ! ssh -o "ConnectTimeout=${SSH_TIMEOUT}" -o BatchMode=yes \
|
log "INFO" "Pulling latest changes..."
|
||||||
"${SSH_USER}@${SSH_HOST}" true 2>/dev/null; then
|
if ! git pull; then
|
||||||
echo "ERROR: Cannot reach ${SSH_HOST} via SSH (timeout ${SSH_TIMEOUT}s)." >&2
|
log "WARN" "Git pull failed, proceeding with local state (offline mode or network flap)"
|
||||||
exit 1
|
|
||||||
fi
|
fi
|
||||||
echo "[ok] ${SSH_HOST} reachable"
|
|
||||||
|
# Ensure runtime directories exist
|
||||||
|
mkdir -p "${RUNTIME_PATH}/config" "${RUNTIME_PATH}/data" "${RUNTIME_PATH}/state" "${RUNTIME_PATH}/logs"
|
||||||
|
|
||||||
|
struct_log "prepare" "$host" "all" "success" "repo_updated"
|
||||||
|
mark_stage_complete "prepare"
|
||||||
}
|
}
|
||||||
|
|
||||||
# ── GATE ─────────────────────────────────────────────────────────────────────
|
stage_validate() {
|
||||||
|
local host=$1
|
||||||
gate() {
|
if is_stage_complete "validate" && [[ "$RESUME" == "true" ]]; then
|
||||||
if [[ "$NO_GATE" == "true" ]]; then
|
log "INFO" "Skipping VALIDATE (already complete)"
|
||||||
echo "=== GATE: SKIPPED ==="
|
|
||||||
echo "WARNING: --no-gate active — pytest + docker build bypassed (emergency mode)." >&2
|
|
||||||
return 0
|
return 0
|
||||||
fi
|
fi
|
||||||
|
|
||||||
echo "=== GATE ==="
|
log "INFO" "Stage: VALIDATE ($host)"
|
||||||
|
set_stage "validate"
|
||||||
|
|
||||||
local services=()
|
for service in "${SERVICES[@]}"; do
|
||||||
|
log "INFO" "Validating $service..."
|
||||||
if [[ "$TARGET" == "control-plane" ]]; then
|
if [[ ! -d "${REPO_PATH}/services/$service" ]]; then
|
||||||
services=("control-plane")
|
log "ERROR" "Service definition not found: $service"
|
||||||
else
|
struct_log "validate" "$host" "$service" "fail" "not_found"
|
||||||
local svc_yaml="${REPO_ROOT}/hosts/${TARGET}/services.yaml"
|
return 1
|
||||||
if [[ ! -f "$svc_yaml" ]]; then
|
|
||||||
echo "ERROR: ${svc_yaml} not found." >&2
|
|
||||||
exit 2
|
|
||||||
fi
|
|
||||||
local svc_list
|
|
||||||
svc_list=$(python3 -c "
|
|
||||||
import yaml
|
|
||||||
with open('${svc_yaml}') as f:
|
|
||||||
data = yaml.safe_load(f)
|
|
||||||
svcs = data.get('services', {})
|
|
||||||
if isinstance(svcs, dict):
|
|
||||||
print('\n'.join(svcs.keys()))
|
|
||||||
elif isinstance(svcs, list):
|
|
||||||
print('\n'.join(svcs))
|
|
||||||
")
|
|
||||||
while IFS= read -r svc; do
|
|
||||||
[[ -z "$svc" ]] && continue
|
|
||||||
if [[ -f "${REPO_ROOT}/services/${svc}/Dockerfile" ]]; then
|
|
||||||
services+=("$svc")
|
|
||||||
fi
|
|
||||||
done <<< "$svc_list"
|
|
||||||
fi
|
|
||||||
|
|
||||||
if [[ ${#services[@]} -eq 0 ]]; then
|
|
||||||
echo "[info] No services with local Dockerfile found for ${TARGET} — gate trivially passes."
|
|
||||||
return 0
|
|
||||||
fi
|
|
||||||
|
|
||||||
echo "Services under gate: ${services[*]}"
|
|
||||||
local gate_failed=false
|
|
||||||
|
|
||||||
for svc in "${services[@]}"; do
|
|
||||||
local svc_dir="${REPO_ROOT}/services/${svc}"
|
|
||||||
|
|
||||||
if [[ -d "${svc_dir}/tests" ]]; then
|
|
||||||
echo "--- pytest: ${svc} ---"
|
|
||||||
if ! python3 -m pytest "${svc_dir}/tests" -q; then
|
|
||||||
echo "GATE FAIL: pytest failed for ${svc}" >&2
|
|
||||||
gate_failed=true
|
|
||||||
fi
|
|
||||||
fi
|
|
||||||
|
|
||||||
echo "--- docker build: ${svc} ---"
|
|
||||||
if ! docker build --quiet "${svc_dir}" >/dev/null; then
|
|
||||||
echo "GATE FAIL: docker build failed for ${svc}" >&2
|
|
||||||
gate_failed=true
|
|
||||||
fi
|
fi
|
||||||
done
|
done
|
||||||
|
|
||||||
if [[ "$gate_failed" == "true" ]]; then
|
struct_log "validate" "$host" "all" "success" "validated"
|
||||||
exit 2
|
mark_stage_complete "validate"
|
||||||
fi
|
|
||||||
echo "[ok] gate passed"
|
|
||||||
}
|
}
|
||||||
|
|
||||||
# ── EXECUTE ──────────────────────────────────────────────────────────────────
|
stage_deploy() {
|
||||||
|
local host=$1
|
||||||
execute() {
|
if is_stage_complete "deploy" && [[ "$RESUME" == "true" ]]; then
|
||||||
echo "=== EXECUTE ==="
|
log "INFO" "Skipping DEPLOY (already complete)"
|
||||||
|
return 0
|
||||||
local cmd_output
|
|
||||||
local cmd_exit=0
|
|
||||||
|
|
||||||
if [[ "$TARGET" == "control-plane" ]]; then
|
|
||||||
echo "Running deploy-control-plane.sh --ssh..."
|
|
||||||
cmd_output=$("${REPO_ROOT}/scripts/deploy/deploy-control-plane.sh" --ssh 2>&1) \
|
|
||||||
|| cmd_exit=$?
|
|
||||||
else
|
|
||||||
echo "SSHing to ${SSH_HOST}: git pull + deploy-node.sh..."
|
|
||||||
cmd_output=$(ssh -o "ConnectTimeout=${SSH_TIMEOUT}" -o BatchMode=yes \
|
|
||||||
"${SSH_USER}@${SSH_HOST}" \
|
|
||||||
'cd ~/homelab-codex-ws && git pull && ./scripts/deploy/deploy-node.sh' 2>&1) \
|
|
||||||
|| cmd_exit=$?
|
|
||||||
fi
|
fi
|
||||||
|
|
||||||
echo "$cmd_output"
|
log "INFO" "Stage: DEPLOY ($host)"
|
||||||
|
set_stage "deploy"
|
||||||
|
|
||||||
if echo "$cmd_output" | grep -qF "[sudo] password"; then
|
local last_s=$(get_last_service)
|
||||||
echo "" >&2
|
local skip=false
|
||||||
echo "ERROR (exit 5): Deploy hit an interactive sudo prompt." >&2
|
if [[ "$RESUME" == "true" && -n "$last_s" ]]; then
|
||||||
echo "Run manually:" >&2
|
skip=true
|
||||||
if [[ "$TARGET" == "control-plane" ]]; then
|
|
||||||
echo " ssh -t ${SSH_USER}@${SSH_HOST} 'cd ~/homelab-codex-ws && git pull origin master && cd services/control-plane && bash deploy-local.sh'" >&2
|
|
||||||
else
|
|
||||||
echo " ssh -t ${SSH_USER}@${SSH_HOST} 'cd ~/homelab-codex-ws && git pull && ./scripts/deploy/deploy-node.sh'" >&2
|
|
||||||
fi
|
|
||||||
exit 5
|
|
||||||
fi
|
fi
|
||||||
|
|
||||||
if [[ $cmd_exit -ne 0 ]]; then
|
for service in "${SERVICES[@]}"; do
|
||||||
echo "ERROR: Deploy command exited ${cmd_exit}." >&2
|
if [[ "$skip" == "true" ]]; then
|
||||||
exit 3
|
if [[ "$service" == "$last_s" ]]; then
|
||||||
fi
|
skip=false
|
||||||
|
log "INFO" "Resuming from $service..."
|
||||||
echo "[ok] execute completed"
|
else
|
||||||
}
|
log "INFO" "Skipping $service (already processed)"
|
||||||
|
continue
|
||||||
# ── VERIFY ───────────────────────────────────────────────────────────────────
|
|
||||||
|
|
||||||
verify() {
|
|
||||||
echo "=== VERIFY ==="
|
|
||||||
|
|
||||||
local ps_output
|
|
||||||
local ps_exit=0
|
|
||||||
ps_output=$(ssh -o "ConnectTimeout=${SSH_TIMEOUT}" -o BatchMode=yes \
|
|
||||||
"${SSH_USER}@${SSH_HOST}" \
|
|
||||||
'docker ps --format "{{.Names}}\t{{.Status}}"' 2>&1) \
|
|
||||||
|| ps_exit=$?
|
|
||||||
|
|
||||||
if [[ $ps_exit -ne 0 ]]; then
|
|
||||||
echo "ERROR: docker ps failed on ${SSH_HOST}:" >&2
|
|
||||||
echo "$ps_output" >&2
|
|
||||||
exit 4
|
|
||||||
fi
|
|
||||||
|
|
||||||
echo "$ps_output"
|
|
||||||
|
|
||||||
local failed=false
|
|
||||||
|
|
||||||
local not_up
|
|
||||||
not_up=$(echo "$ps_output" | grep -v '^$' | grep -v $'\tUp' || true)
|
|
||||||
if [[ -n "$not_up" ]]; then
|
|
||||||
echo "ERROR: Containers not in Up state:" >&2
|
|
||||||
echo "$not_up" >&2
|
|
||||||
failed=true
|
|
||||||
fi
|
|
||||||
|
|
||||||
local unhealthy
|
|
||||||
unhealthy=$(echo "$ps_output" | grep '(unhealthy)' || true)
|
|
||||||
if [[ -n "$unhealthy" ]]; then
|
|
||||||
echo "ERROR: Unhealthy containers:" >&2
|
|
||||||
echo "$unhealthy" >&2
|
|
||||||
failed=true
|
|
||||||
fi
|
|
||||||
|
|
||||||
if [[ "$TARGET" == "control-plane" ]]; then
|
|
||||||
for cp_svc in supervisor observer executor operator-ui; do
|
|
||||||
if ! echo "$ps_output" | grep -q "$cp_svc"; then
|
|
||||||
echo "ERROR: control-plane component absent from docker ps: ${cp_svc}" >&2
|
|
||||||
failed=true
|
|
||||||
fi
|
fi
|
||||||
done
|
fi
|
||||||
fi
|
|
||||||
|
|
||||||
if [[ "$failed" == "true" ]]; then
|
log "INFO" "Deploying $service..."
|
||||||
echo "" >&2
|
set_last_service "$service"
|
||||||
echo "Full docker ps output above." >&2
|
|
||||||
exit 4
|
|
||||||
fi
|
|
||||||
|
|
||||||
echo "[ok] all containers healthy"
|
if ! run_compose_up "$service"; then
|
||||||
|
struct_log "deploy" "$host" "$service" "fail" "docker_compose_failed"
|
||||||
|
collect_diagnostics "$host" "$service"
|
||||||
|
return 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
struct_log "deploy" "$host" "$service" "success" "deployed"
|
||||||
|
done
|
||||||
|
|
||||||
|
set_last_service ""
|
||||||
|
mark_stage_complete "deploy"
|
||||||
}
|
}
|
||||||
|
|
||||||
# ── REPORT ───────────────────────────────────────────────────────────────────
|
stage_verify() {
|
||||||
|
local host=$1
|
||||||
report() {
|
if is_stage_complete "verify" && [[ "$RESUME" == "true" ]]; then
|
||||||
local mode="${1:-deploy}"
|
log "INFO" "Skipping VERIFY (already complete)"
|
||||||
local end_time
|
return 0
|
||||||
end_time=$(date +%s)
|
|
||||||
local elapsed
|
|
||||||
elapsed=$(( end_time - START_TIME ))
|
|
||||||
local commit_hash
|
|
||||||
commit_hash=$(git -C "$REPO_ROOT" rev-parse --short HEAD)
|
|
||||||
local gate_s verify_s
|
|
||||||
|
|
||||||
if [[ "$NO_GATE" == "true" ]]; then
|
|
||||||
gate_s="skip"
|
|
||||||
else
|
|
||||||
gate_s="ok"
|
|
||||||
fi
|
fi
|
||||||
|
|
||||||
if [[ "$mode" == "dry-run" ]]; then
|
log "INFO" "Stage: VERIFY ($host)"
|
||||||
verify_s="skip(dry-run)"
|
set_stage "verify"
|
||||||
else
|
|
||||||
verify_s="green"
|
|
||||||
fi
|
|
||||||
|
|
||||||
echo ""
|
for service in "${SERVICES[@]}"; do
|
||||||
if [[ "$mode" == "dry-run" ]]; then
|
log "INFO" "Verifying $service..."
|
||||||
echo "DRY RUN OK | target=${TARGET} | commit=${commit_hash} | gate=${gate_s} | verify=${verify_s} | ${elapsed}s"
|
local health_script="${REPO_PATH}/services/${service}/healthcheck.sh"
|
||||||
else
|
if [[ -f "$health_script" ]]; then
|
||||||
echo "DEPLOY OK | target=${TARGET} | commit=${commit_hash} | gate=${gate_s} | verify=${verify_s} | ${elapsed}s"
|
if ! bash "$health_script"; then
|
||||||
fi
|
log "ERROR" "Healthcheck failed for $service"
|
||||||
|
struct_log "verify" "$host" "$service" "fail" "healthcheck_failed"
|
||||||
|
collect_diagnostics "$host" "$service"
|
||||||
|
return 1
|
||||||
|
fi
|
||||||
|
else
|
||||||
|
# Generic check if container is running
|
||||||
|
if ! docker ps --filter "name=$service" --filter "status=running" | grep -q "$service"; then
|
||||||
|
log "ERROR" "Container $service is not running"
|
||||||
|
struct_log "verify" "$host" "$service" "fail" "container_not_running"
|
||||||
|
collect_diagnostics "$host" "$service"
|
||||||
|
return 1
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
struct_log "verify" "$host" "$service" "success" "verified"
|
||||||
|
done
|
||||||
|
mark_stage_complete "verify"
|
||||||
}
|
}
|
||||||
|
|
||||||
# ── MAIN ─────────────────────────────────────────────────────────────────────
|
stage_complete() {
|
||||||
|
local host=$1
|
||||||
|
log "INFO" "Stage: COMPLETE ($host)"
|
||||||
|
set_stage "complete"
|
||||||
|
struct_log "complete" "$host" "all" "success" "deployment_finished"
|
||||||
|
clear_deployment_state
|
||||||
|
}
|
||||||
|
|
||||||
preflight
|
# --- Execution Logic ---
|
||||||
gate
|
|
||||||
|
|
||||||
if [[ "$DRY_RUN" == "true" ]]; then
|
run_deployment() {
|
||||||
report dry-run
|
local start_stage=$1
|
||||||
exit 0
|
|
||||||
|
# Sequential execution from start_stage
|
||||||
|
case "$start_stage" in
|
||||||
|
prepare)
|
||||||
|
stage_prepare "$TARGET_HOST" || return 1
|
||||||
|
;&
|
||||||
|
validate)
|
||||||
|
stage_validate "$TARGET_HOST" || return 1
|
||||||
|
;&
|
||||||
|
deploy)
|
||||||
|
stage_deploy "$TARGET_HOST" || return 1
|
||||||
|
;&
|
||||||
|
verify)
|
||||||
|
stage_verify "$TARGET_HOST" || return 1
|
||||||
|
;&
|
||||||
|
complete)
|
||||||
|
stage_complete "$TARGET_HOST" || return 1
|
||||||
|
;;
|
||||||
|
*)
|
||||||
|
log "ERROR" "Invalid stage: $start_stage"
|
||||||
|
return 1
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
}
|
||||||
|
|
||||||
|
# --- Main ---
|
||||||
|
|
||||||
|
log "INFO" "--- Homelab Deployment Started (Host: $TARGET_HOST, Service: ${TARGET_SERVICE:-all}) ---"
|
||||||
|
|
||||||
|
if ! load_inventory "$TARGET_HOST" "$TARGET_SERVICE"; then
|
||||||
|
log "ERROR" "Failed to load inventory"
|
||||||
|
exit 1
|
||||||
fi
|
fi
|
||||||
|
|
||||||
execute
|
EXIT_STATUS=0
|
||||||
verify
|
if [[ "$RESUME" == "true" ]]; then
|
||||||
report
|
CURRENT=$(get_stage)
|
||||||
|
log "INFO" "Resuming from state: $CURRENT"
|
||||||
|
case "$CURRENT" in
|
||||||
|
prepare|validate|deploy|verify)
|
||||||
|
run_deployment "$CURRENT" || EXIT_STATUS=1
|
||||||
|
;;
|
||||||
|
complete|none)
|
||||||
|
log "INFO" "No interrupted deployment found. Starting from scratch..."
|
||||||
|
run_deployment "prepare" || EXIT_STATUS=1
|
||||||
|
;;
|
||||||
|
*)
|
||||||
|
log "INFO" "Unknown state. Starting from prepare..."
|
||||||
|
run_deployment "prepare" || EXIT_STATUS=1
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
elif [[ -n "$REQUESTED_STAGE" ]]; then
|
||||||
|
if [[ "$REQUESTED_STAGE" == "diagnose" ]]; then
|
||||||
|
collect_diagnostics "$TARGET_HOST" "$TARGET_SERVICE"
|
||||||
|
else
|
||||||
|
run_deployment "$REQUESTED_STAGE" || EXIT_STATUS=1
|
||||||
|
fi
|
||||||
|
else
|
||||||
|
# New deployment - clear previous state
|
||||||
|
clear_deployment_state
|
||||||
|
run_deployment "prepare" || EXIT_STATUS=1
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ $EXIT_STATUS -eq 0 ]]; then
|
||||||
|
print_summary "$TARGET_HOST" "SUCCESS"
|
||||||
|
log "INFO" "--- Homelab Deployment Finished Successfully ---"
|
||||||
|
else
|
||||||
|
print_summary "$TARGET_HOST" "FAILED"
|
||||||
|
log "ERROR" "--- Homelab Deployment Failed ---"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
|
||||||
|
|
@ -1,30 +1,15 @@
|
||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
# orchestrate-deploy.sh - To be run on SATURN
|
# orchestrate-deploy.sh - To be run on SATURN
|
||||||
# Triggers deployment on remote execution nodes via inventory.
|
# Triggers deployment on remote execution nodes.
|
||||||
|
|
||||||
set -e
|
set -e
|
||||||
|
|
||||||
REPO_PATH="${HOME}/homelab-codex-ws"
|
HOSTS=("solaria" "piha" "vps")
|
||||||
USER="oskar"
|
USER="oskar" # Default user
|
||||||
|
|
||||||
while IFS=' ' read -r HOST TAG; do
|
for HOST in "${HOSTS[@]}"; do
|
||||||
echo ">>> Triggering deployment on ${HOST}..."
|
echo ">>> Triggering deployment on ${HOST}..."
|
||||||
if [[ "$TAG" == "lte" ]]; then
|
ssh "${USER}@${HOST}" "bash ~/homelab-codex-ws/scripts/deploy/deploy-node.sh"
|
||||||
ssh -o ConnectTimeout=30 "${USER}@${HOST}" "bash ~/homelab-codex-ws/scripts/deploy/deploy-node.sh" || \
|
done
|
||||||
echo "WARNING: Deployment on ${HOST} failed or timed out (LTE/intermittent node, skipping)"
|
|
||||||
else
|
|
||||||
ssh "${USER}@${HOST}" "bash ~/homelab-codex-ws/scripts/deploy/deploy-node.sh"
|
|
||||||
fi
|
|
||||||
done < <(python3 -c "
|
|
||||||
import yaml, sys
|
|
||||||
with open('${REPO_PATH}/inventory/topology.yaml') as f:
|
|
||||||
data = yaml.safe_load(f)
|
|
||||||
skip = {'saturn', 'solaria'}
|
|
||||||
for name, info in (data.get('nodes') or {}).items():
|
|
||||||
if name in skip:
|
|
||||||
continue
|
|
||||||
uplink = ((info or {}).get('connectivity') or {}).get('uplink', '')
|
|
||||||
print(name, 'lte' if uplink == 'lte' else 'standard')
|
|
||||||
")
|
|
||||||
|
|
||||||
echo ">>> All deployments triggered."
|
echo ">>> All deployments triggered."
|
||||||
|
|
|
||||||
|
|
@ -1,68 +0,0 @@
|
||||||
#!/usr/bin/env bash
|
|
||||||
# verify-agent-fleet.sh - Check the status of stability agents across the fleet
|
|
||||||
|
|
||||||
REDIS_CMD="docker exec agent-system-redis redis-cli --raw"
|
|
||||||
|
|
||||||
# Check if docker is available
|
|
||||||
if ! command -v docker &> /dev/null; then
|
|
||||||
echo "Error: docker command not found."
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
|
|
||||||
# Check if container is running
|
|
||||||
if ! docker ps --filter "name=agent-system-redis" --format "{{.Names}}" | grep -q "agent-system-redis"; then
|
|
||||||
echo "Error: agent-system-redis container not found or not running."
|
|
||||||
echo "This script must be run on PIHA (the node hosting the Redis container)."
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
|
|
||||||
REQUIRED_NODES=("piha" "chelsty" "solaria" "vps")
|
|
||||||
MISSING_NODES=0
|
|
||||||
|
|
||||||
echo "--- Homelab Agent Fleet Status ---"
|
|
||||||
printf "%-10s %-15s %-10s %-10s %-30s\n" "NODE" "HOSTNAME" "HEALTH" "STATUS" "LAST_SEEN"
|
|
||||||
printf "%s\n" "--------------------------------------------------------------------------------"
|
|
||||||
|
|
||||||
for NODE in "${REQUIRED_NODES[@]}"; do
|
|
||||||
KEY="homelab:nodes:$NODE"
|
|
||||||
|
|
||||||
# Check if key exists
|
|
||||||
EXISTS=$($REDIS_CMD EXISTS "$KEY" 2>/dev/null | tr -d '\r\n')
|
|
||||||
|
|
||||||
if [[ "$EXISTS" != "1" ]]; then
|
|
||||||
printf "%-10s %-15s %-10s %-10s %-30s\n" "$NODE" "MISSING" "N/A" "N/A" "N/A"
|
|
||||||
MISSING_NODES=$((MISSING_NODES + 1))
|
|
||||||
continue
|
|
||||||
fi
|
|
||||||
|
|
||||||
HOSTNAME=$($REDIS_CMD HGET "$KEY" hostname 2>/dev/null | tr -d '\r\n')
|
|
||||||
HEALTH=$($REDIS_CMD HGET "$KEY" health 2>/dev/null | tr -d '\r\n')
|
|
||||||
STATUS=$($REDIS_CMD HGET "$KEY" status 2>/dev/null | tr -d '\r\n')
|
|
||||||
LAST_SEEN=$($REDIS_CMD HGET "$KEY" last_seen 2>/dev/null | tr -d '\r\n')
|
|
||||||
|
|
||||||
printf "%-10s %-15s %-10s %-10s %-30s\n" "$NODE" "$HOSTNAME" "$HEALTH" "$STATUS" "$LAST_SEEN"
|
|
||||||
done
|
|
||||||
|
|
||||||
echo ""
|
|
||||||
echo "--- Control Plane Summary ---"
|
|
||||||
if command -v jq >/dev/null; then
|
|
||||||
curl -s http://127.0.0.1:18180/summary | jq .
|
|
||||||
else
|
|
||||||
curl -s http://127.0.0.1:18180/summary
|
|
||||||
fi
|
|
||||||
|
|
||||||
echo ""
|
|
||||||
echo "--- Control Plane Nodes ---"
|
|
||||||
if command -v jq >/dev/null; then
|
|
||||||
curl -s http://127.0.0.1:18180/nodes | jq .
|
|
||||||
else
|
|
||||||
curl -s http://127.0.0.1:18180/nodes
|
|
||||||
fi
|
|
||||||
|
|
||||||
if [[ $MISSING_NODES -gt 0 ]]; then
|
|
||||||
echo ""
|
|
||||||
echo "Error: $MISSING_NODES required nodes are missing from Redis."
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
|
|
||||||
exit 0
|
|
||||||
|
|
@ -1,361 +0,0 @@
|
||||||
#!/usr/bin/env bash
|
|
||||||
# Multi-agent worktree manager.
|
|
||||||
# EXIT: 0 ok, 1 preflight, 2 operation failed.
|
|
||||||
set -euo pipefail
|
|
||||||
|
|
||||||
trap 'echo "agent.sh: failed at line $LINENO (exit $?)" >&2' ERR
|
|
||||||
|
|
||||||
RESERVED_NAMES=(master main HEAD list merge clean new)
|
|
||||||
MAX_WORKTREES=4
|
|
||||||
|
|
||||||
die() { echo "ERROR: $*" >&2; exit "${2:-2}"; }
|
|
||||||
prefail(){ echo "PREFLIGHT: $*" >&2; exit 1; }
|
|
||||||
|
|
||||||
# ── helpers ──────────────────────────────────────────────────────────────────
|
|
||||||
|
|
||||||
is_main_checkout() {
|
|
||||||
local git_dir common_dir
|
|
||||||
git_dir=$(git rev-parse --git-dir 2>/dev/null) || return 1
|
|
||||||
common_dir=$(git rev-parse --git-common-dir 2>/dev/null) || return 1
|
|
||||||
[ "$git_dir" = "$common_dir" ]
|
|
||||||
}
|
|
||||||
|
|
||||||
require_main_checkout() {
|
|
||||||
is_main_checkout || prefail "must run from the main checkout, not a worktree"
|
|
||||||
}
|
|
||||||
|
|
||||||
require_master_branch() {
|
|
||||||
local branch
|
|
||||||
branch=$(git rev-parse --abbrev-ref HEAD)
|
|
||||||
[ "$branch" = "master" ] || prefail "must be on master (currently on '$branch')"
|
|
||||||
}
|
|
||||||
|
|
||||||
require_clean_tree() {
|
|
||||||
local dirty
|
|
||||||
dirty=$(git status --porcelain)
|
|
||||||
[ -z "$dirty" ] || prefail "working tree is not clean — stash or commit first"
|
|
||||||
}
|
|
||||||
|
|
||||||
worktree_paths() {
|
|
||||||
# list worktree paths (excluding main); || true prevents grep exit-1 when empty
|
|
||||||
local main_path
|
|
||||||
main_path=$(git rev-parse --show-toplevel)
|
|
||||||
git worktree list --porcelain \
|
|
||||||
| awk '/^worktree /{p=$2} /^$/{print p}' \
|
|
||||||
| grep -v "^${main_path}$" \
|
|
||||||
|| true
|
|
||||||
}
|
|
||||||
|
|
||||||
worktree_count() {
|
|
||||||
worktree_paths | wc -l
|
|
||||||
}
|
|
||||||
|
|
||||||
branch_exists_local() { git show-ref --verify --quiet "refs/heads/$1"; }
|
|
||||||
branch_exists_remote() { git ls-remote --exit-code origin "$1" >/dev/null 2>&1; }
|
|
||||||
|
|
||||||
utc_now() { date -u +"%Y-%m-%dT%H:%M:%SZ"; }
|
|
||||||
|
|
||||||
age_str() {
|
|
||||||
local created_utc="$1"
|
|
||||||
local now_ts created_ts diff_s
|
|
||||||
now_ts=$(date -u +%s)
|
|
||||||
# strip Z, replace T with space for `date -d`
|
|
||||||
created_ts=$(date -u -d "${created_utc//T/ }" +%s 2>/dev/null) || { echo "?"; return; }
|
|
||||||
diff_s=$(( now_ts - created_ts ))
|
|
||||||
if (( diff_s < 60 )); then echo "${diff_s}s"
|
|
||||||
elif (( diff_s < 3600 )); then echo "$(( diff_s/60 ))m"
|
|
||||||
elif (( diff_s < 86400 )); then echo "$(( diff_s/3600 ))h"
|
|
||||||
else echo "$(( diff_s/86400 ))d"
|
|
||||||
fi
|
|
||||||
}
|
|
||||||
|
|
||||||
validate_name() {
|
|
||||||
local name="$1"
|
|
||||||
if ! [[ "$name" =~ ^[a-z][a-z0-9-]*$ ]]; then
|
|
||||||
prefail "name '$name' must match ^[a-z][a-z0-9-]*$"
|
|
||||||
fi
|
|
||||||
for r in "${RESERVED_NAMES[@]}"; do
|
|
||||||
if [ "$name" = "$r" ]; then
|
|
||||||
prefail "'$name' is a reserved word"
|
|
||||||
fi
|
|
||||||
done
|
|
||||||
}
|
|
||||||
|
|
||||||
# ── subcommands ───────────────────────────────────────────────────────────────
|
|
||||||
|
|
||||||
cmd_new() {
|
|
||||||
local name="${1:-}"
|
|
||||||
[ -n "$name" ] || { usage; exit 1; }
|
|
||||||
|
|
||||||
validate_name "$name"
|
|
||||||
require_main_checkout
|
|
||||||
require_master_branch
|
|
||||||
require_clean_tree
|
|
||||||
|
|
||||||
# worktree limit
|
|
||||||
local count
|
|
||||||
count=$(worktree_count)
|
|
||||||
if (( count >= MAX_WORKTREES )); then
|
|
||||||
echo "ERROR: already at maximum of $MAX_WORKTREES active worktrees:" >&2
|
|
||||||
cmd_list
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
|
|
||||||
# branch collision
|
|
||||||
if branch_exists_local "task/$name"; then
|
|
||||||
prefail "branch task/$name already exists locally"
|
|
||||||
fi
|
|
||||||
git fetch origin master --quiet
|
|
||||||
if branch_exists_remote "refs/heads/task/$name"; then
|
|
||||||
prefail "branch task/$name already exists on origin"
|
|
||||||
fi
|
|
||||||
|
|
||||||
# directory collision
|
|
||||||
local main_path wt_path
|
|
||||||
main_path=$(git rev-parse --show-toplevel)
|
|
||||||
wt_path="$(dirname "$main_path")/homelab-codex-ws-${name}"
|
|
||||||
[ ! -e "$wt_path" ] || prefail "directory $wt_path already exists"
|
|
||||||
|
|
||||||
# create worktree
|
|
||||||
git worktree add -b "task/$name" "$wt_path" origin/master \
|
|
||||||
|| die "git worktree add failed"
|
|
||||||
|
|
||||||
# write marker
|
|
||||||
local parent_commit
|
|
||||||
parent_commit=$(git rev-parse origin/master)
|
|
||||||
cat > "$wt_path/.agent-task" <<EOF
|
|
||||||
task: $name
|
|
||||||
branch: task/$name
|
|
||||||
parent_commit: $parent_commit
|
|
||||||
created_utc: $(utc_now)
|
|
||||||
worktree_path: $wt_path
|
|
||||||
EOF
|
|
||||||
|
|
||||||
echo ""
|
|
||||||
echo "Worktree created: $wt_path"
|
|
||||||
echo "Branch: task/$name"
|
|
||||||
echo ""
|
|
||||||
echo "── Start Claude Code in this worktree ──────────────────────────────────────"
|
|
||||||
echo "cd ~/homelab-codex-ws-${name} && claude --dangerously-skip-permissions \"Jesteś w worktree task '${name}' (branch task/${name}). NAJPIERW przeczytaj .agent-task i .claude/skills/worktree-aware/SKILL.md, dopiero potem zacznij pracę. Commituj wyłącznie na swoją gałąź; nie pushuj origin master.\""
|
|
||||||
echo "─────────────────────────────────────────────────────────────────────────────"
|
|
||||||
}
|
|
||||||
|
|
||||||
cmd_list() {
|
|
||||||
local main_path
|
|
||||||
main_path=$(git rev-parse --show-toplevel)
|
|
||||||
|
|
||||||
# fetch to get up-to-date ahead/behind
|
|
||||||
git fetch origin master --quiet 2>/dev/null || true
|
|
||||||
|
|
||||||
local paths
|
|
||||||
paths=$(worktree_paths)
|
|
||||||
|
|
||||||
if [ -z "$paths" ]; then
|
|
||||||
echo "(no active task worktrees)"
|
|
||||||
return
|
|
||||||
fi
|
|
||||||
|
|
||||||
printf "%-20s %-25s %-10s %-8s %-8s %-7s %s\n" \
|
|
||||||
"NAME" "BRANCH" "CREATED" "AGE" "STATUS" "A/B" "PARENT"
|
|
||||||
|
|
||||||
while IFS= read -r wt_path; do
|
|
||||||
[ -z "$wt_path" ] && continue
|
|
||||||
|
|
||||||
local marker="$wt_path/.agent-task"
|
|
||||||
local task_name branch parent_commit created_utc
|
|
||||||
if [ -f "$marker" ]; then
|
|
||||||
task_name=$( grep '^task:' "$marker" | awk '{print $2}')
|
|
||||||
branch=$( grep '^branch:' "$marker" | awk '{print $2}')
|
|
||||||
parent_commit=$(grep '^parent_commit:' "$marker" | awk '{print $2}')
|
|
||||||
created_utc=$(grep '^created_utc:' "$marker" | awk '{print $2}')
|
|
||||||
else
|
|
||||||
task_name="(no marker)"
|
|
||||||
branch=$(git -C "$wt_path" rev-parse --abbrev-ref HEAD 2>/dev/null || echo "?")
|
|
||||||
parent_commit="?"
|
|
||||||
created_utc=""
|
|
||||||
fi
|
|
||||||
|
|
||||||
local status="clean"
|
|
||||||
local dirty
|
|
||||||
dirty=$(git -C "$wt_path" status --porcelain 2>/dev/null || echo "?")
|
|
||||||
[ -n "$dirty" ] && status="dirty"
|
|
||||||
|
|
||||||
local ahead behind ab
|
|
||||||
ahead=$(git -C "$wt_path" rev-list --count "origin/master..${branch}" 2>/dev/null || echo "?")
|
|
||||||
behind=$(git -C "$wt_path" rev-list --count "${branch}..origin/master" 2>/dev/null || echo "?")
|
|
||||||
ab="+${ahead}/-${behind}"
|
|
||||||
|
|
||||||
local age=""
|
|
||||||
[ -n "$created_utc" ] && age=$(age_str "$created_utc")
|
|
||||||
|
|
||||||
local short_parent="${parent_commit:0:7}"
|
|
||||||
local short_created="${created_utc:0:10}"
|
|
||||||
|
|
||||||
printf "%-20s %-25s %-10s %-8s %-8s %-7s %s\n" \
|
|
||||||
"$task_name" "$branch" "$short_created" "$age" "$status" "$ab" "$short_parent"
|
|
||||||
done <<< "$paths"
|
|
||||||
}
|
|
||||||
|
|
||||||
cmd_merge() {
|
|
||||||
local name="${1:-}"
|
|
||||||
[ -n "$name" ] || { usage; exit 1; }
|
|
||||||
|
|
||||||
require_main_checkout
|
|
||||||
require_master_branch
|
|
||||||
require_clean_tree
|
|
||||||
|
|
||||||
git fetch origin --quiet
|
|
||||||
|
|
||||||
branch_exists_local "task/$name" || die "branch task/$name not found locally" 1
|
|
||||||
|
|
||||||
local main_path wt_path
|
|
||||||
main_path=$(git rev-parse --show-toplevel)
|
|
||||||
wt_path="$(dirname "$main_path")/homelab-codex-ws-${name}"
|
|
||||||
|
|
||||||
# attempt ff-only merge
|
|
||||||
local merge_failed=0
|
|
||||||
git merge --ff-only "task/$name" || merge_failed=1
|
|
||||||
|
|
||||||
if (( merge_failed )); then
|
|
||||||
# abort any partial merge state
|
|
||||||
git merge --abort 2>/dev/null || true
|
|
||||||
echo ""
|
|
||||||
echo "ERROR: task/$name cannot be fast-forwarded into master." >&2
|
|
||||||
echo " The branch has likely diverged from master." >&2
|
|
||||||
echo "" >&2
|
|
||||||
echo "Diagnose with:" >&2
|
|
||||||
echo " git log master..task/$name # commits only on task branch" >&2
|
|
||||||
echo " git log task/$name..master # commits master has that task doesn't" >&2
|
|
||||||
echo "" >&2
|
|
||||||
echo "Then decide: rebase task/$name onto master, or merge manually." >&2
|
|
||||||
echo "Worktree and branch are preserved — no changes made." >&2
|
|
||||||
exit 2
|
|
||||||
fi
|
|
||||||
|
|
||||||
echo "Merged task/$name into master (fast-forward)."
|
|
||||||
|
|
||||||
git push origin master || die "git push origin master failed"
|
|
||||||
echo "Pushed master to origin."
|
|
||||||
|
|
||||||
if [ -d "$wt_path" ]; then
|
|
||||||
git worktree remove "$wt_path" || die "git worktree remove $wt_path failed"
|
|
||||||
echo "Removed worktree: $wt_path"
|
|
||||||
else
|
|
||||||
echo "(worktree directory $wt_path not found — skipping worktree remove)"
|
|
||||||
fi
|
|
||||||
|
|
||||||
git branch -d "task/$name" || die "git branch -d task/$name failed"
|
|
||||||
echo "Deleted local branch task/$name."
|
|
||||||
|
|
||||||
git push origin --delete "task/$name" 2>/dev/null \
|
|
||||||
&& echo "Deleted remote branch task/$name." \
|
|
||||||
|| echo "(remote branch task/$name not found — nothing to delete)"
|
|
||||||
|
|
||||||
echo ""
|
|
||||||
echo "Done. task/$name merged and cleaned up."
|
|
||||||
}
|
|
||||||
|
|
||||||
cmd_clean() {
|
|
||||||
local main_path
|
|
||||||
main_path=$(git rev-parse --show-toplevel)
|
|
||||||
git fetch origin --quiet 2>/dev/null || true
|
|
||||||
|
|
||||||
local to_remove=()
|
|
||||||
|
|
||||||
# orphaned registered worktrees: branch deleted or fully merged into master
|
|
||||||
local paths
|
|
||||||
paths=$(worktree_paths)
|
|
||||||
while IFS= read -r wt_path; do
|
|
||||||
[ -z "$wt_path" ] && continue
|
|
||||||
local branch
|
|
||||||
branch=$(git -C "$wt_path" rev-parse --abbrev-ref HEAD 2>/dev/null || echo "")
|
|
||||||
[ -z "$branch" ] && { to_remove+=("worktree:$wt_path (unreadable branch)"); continue; }
|
|
||||||
|
|
||||||
# branch gone locally?
|
|
||||||
if ! branch_exists_local "$branch"; then
|
|
||||||
to_remove+=("worktree:$wt_path (branch $branch no longer exists)")
|
|
||||||
continue
|
|
||||||
fi
|
|
||||||
|
|
||||||
# branch fully merged into master?
|
|
||||||
local ahead
|
|
||||||
ahead=$(git rev-list --count "origin/master..${branch}" 2>/dev/null || echo "1")
|
|
||||||
if [ "$ahead" = "0" ]; then
|
|
||||||
to_remove+=("worktree:$wt_path (branch $branch fully merged into origin/master)")
|
|
||||||
fi
|
|
||||||
done <<< "$paths"
|
|
||||||
|
|
||||||
# dangling directories: ../homelab-codex-ws-* not registered
|
|
||||||
local registered_paths
|
|
||||||
registered_paths=$(git worktree list --porcelain | awk '/^worktree /{print $2}')
|
|
||||||
local parent_dir
|
|
||||||
parent_dir=$(dirname "$main_path")
|
|
||||||
while IFS= read -r candidate; do
|
|
||||||
[ -d "$candidate" ] || continue
|
|
||||||
if ! echo "$registered_paths" | grep -qF "$candidate"; then
|
|
||||||
to_remove+=("dangling:$candidate")
|
|
||||||
fi
|
|
||||||
done < <(find "$parent_dir" -maxdepth 1 -name "homelab-codex-ws-*" -type d 2>/dev/null)
|
|
||||||
|
|
||||||
if [ ${#to_remove[@]} -eq 0 ]; then
|
|
||||||
echo "Nothing to clean."
|
|
||||||
return 0
|
|
||||||
fi
|
|
||||||
|
|
||||||
echo "Found ${#to_remove[@]} item(s) to clean:"
|
|
||||||
for entry in "${to_remove[@]}"; do
|
|
||||||
echo " $entry"
|
|
||||||
done
|
|
||||||
echo ""
|
|
||||||
|
|
||||||
local overall_rc=0
|
|
||||||
for entry in "${to_remove[@]}"; do
|
|
||||||
local kind="${entry%%:*}"
|
|
||||||
local path="${entry#*:}"
|
|
||||||
# strip trailing annotation in parens
|
|
||||||
local raw_path
|
|
||||||
raw_path="${path%% (*}"
|
|
||||||
|
|
||||||
local confirm
|
|
||||||
read -r -p "Remove $kind '$raw_path'? [y/N] " confirm
|
|
||||||
if [[ "$confirm" =~ ^[Yy]$ ]]; then
|
|
||||||
if [ "$kind" = "worktree" ]; then
|
|
||||||
git worktree remove --force "$raw_path" 2>/dev/null \
|
|
||||||
|| { echo " WARNING: git worktree remove failed, trying rm -rf"; rm -rf "$raw_path" || true; }
|
|
||||||
else
|
|
||||||
rm -rf "$raw_path"
|
|
||||||
fi
|
|
||||||
echo " Removed."
|
|
||||||
else
|
|
||||||
echo " Skipped."
|
|
||||||
fi
|
|
||||||
done
|
|
||||||
|
|
||||||
return $overall_rc
|
|
||||||
}
|
|
||||||
|
|
||||||
usage() {
|
|
||||||
cat <<'EOF'
|
|
||||||
Usage: agent.sh <subcommand> [args]
|
|
||||||
|
|
||||||
agent.sh new <name> Create a new task worktree (branch task/<name>)
|
|
||||||
agent.sh list List active task worktrees with status
|
|
||||||
agent.sh merge <name> Fast-forward merge task/<name> into master and clean up
|
|
||||||
agent.sh clean Remove orphaned or dangling worktrees (interactive)
|
|
||||||
|
|
||||||
EXIT: 0 ok, 1 preflight, 2 operation failed.
|
|
||||||
EOF
|
|
||||||
}
|
|
||||||
|
|
||||||
# ── dispatch ──────────────────────────────────────────────────────────────────
|
|
||||||
|
|
||||||
SUBCOMMAND="${1:-}"
|
|
||||||
shift || true
|
|
||||||
|
|
||||||
case "$SUBCOMMAND" in
|
|
||||||
new) cmd_new "$@" ;;
|
|
||||||
list) cmd_list "$@" ;;
|
|
||||||
merge) cmd_merge "$@" ;;
|
|
||||||
clean) cmd_clean "$@" ;;
|
|
||||||
*) usage; exit 1 ;;
|
|
||||||
esac
|
|
||||||
|
|
@ -1,338 +0,0 @@
|
||||||
#!/usr/bin/env bash
|
|
||||||
# health-monitor.sh - Homelab node health monitor and safe disk cleanup
|
|
||||||
#
|
|
||||||
# Designed to run standalone on the host (cron or direct) or to be called by
|
|
||||||
# the node-agent Python daemon. All cleanup decisions follow the conservative
|
|
||||||
# policy agreed in the design review:
|
|
||||||
#
|
|
||||||
# lte_node (chelsty-infra, chelsty-ha) : NO cleanup at all
|
|
||||||
# sd_card (piha, saturn) : dangling images + stopped containers,
|
|
||||||
# rate-limited to once per 24 h
|
|
||||||
# ai_node (solaria) : dangling images + stopped containers
|
|
||||||
# + build cache (NEVER -a)
|
|
||||||
# standard (vps) : dangling images + stopped containers
|
|
||||||
# + build cache
|
|
||||||
#
|
|
||||||
# VPS additionally rotates control-plane filesystem artefacts:
|
|
||||||
# actions/completed + failed > 7 days
|
|
||||||
# logs/deploy > 30 days
|
|
||||||
# events/** > 3 days AND past observer checkpoint
|
|
||||||
#
|
|
||||||
# NEVER TOUCHED (any node): /opt/homelab/data/, config/, state/,
|
|
||||||
# actions/pending|approved|running, Frigate recordings, Ollama models,
|
|
||||||
# Zigbee2MQTT data, Mosquitto data, HA database/config.
|
|
||||||
|
|
||||||
set -euo pipefail
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Configuration
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
RUNTIME_PATH="${RUNTIME_PATH:-/opt/homelab}"
|
|
||||||
EVENTS_DIR="${RUNTIME_PATH}/events"
|
|
||||||
STATE_DIR="${RUNTIME_PATH}/state"
|
|
||||||
LOGS_DIR="${RUNTIME_PATH}/logs"
|
|
||||||
ACTIONS_DIR="${RUNTIME_PATH}/actions"
|
|
||||||
|
|
||||||
NODE_NAME="${NODE_NAME:-$(hostname)}"
|
|
||||||
TIMESTAMP=$(date +%s)
|
|
||||||
DATE=$(date -u +%Y-%m-%dT%H:%M:%SZ)
|
|
||||||
|
|
||||||
# Thresholds
|
|
||||||
DISK_WARN_PCT=75
|
|
||||||
DISK_CRIT_PCT=85
|
|
||||||
MEM_WARN_PCT=85
|
|
||||||
MEM_CRIT_PCT=95
|
|
||||||
|
|
||||||
# Rate-limit file for SD-card nodes (max one Docker cleanup per 24 h)
|
|
||||||
CLEANUP_LOCK="${STATE_DIR}/last-docker-cleanup"
|
|
||||||
CLEANUP_INTERVAL=86400 # seconds
|
|
||||||
|
|
||||||
# Node classifications
|
|
||||||
LTE_NODES="chelsty-infra chelsty-ha"
|
|
||||||
SD_CARD_NODES="piha saturn"
|
|
||||||
AI_NODES="solaria"
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Helpers
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
log() { echo "$(date -u +%H:%M:%S) [INFO] $*"; }
|
|
||||||
warn() { echo "$(date -u +%H:%M:%S) [WARN] $*" >&2; }
|
|
||||||
err() { echo "$(date -u +%H:%M:%S) [ERROR] $*" >&2; }
|
|
||||||
|
|
||||||
contains() {
|
|
||||||
local word="$1"; shift
|
|
||||||
for w in "$@"; do [[ "$w" == "$word" ]] && return 0; done
|
|
||||||
return 1
|
|
||||||
}
|
|
||||||
|
|
||||||
get_node_type() {
|
|
||||||
# shellcheck disable=SC2086
|
|
||||||
if contains "$NODE_NAME" $LTE_NODES; then echo "lte_node"; return; fi
|
|
||||||
if contains "$NODE_NAME" $SD_CARD_NODES; then echo "sd_card"; return; fi
|
|
||||||
if contains "$NODE_NAME" $AI_NODES; then echo "ai_node"; return; fi
|
|
||||||
echo "standard"
|
|
||||||
}
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Event emission
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
emit_event() {
|
|
||||||
local type="$1" severity="$2" service="${3:-}" message="$4" payload="${5:-{}}"
|
|
||||||
local id="evt-${NODE_NAME}-${TIMESTAMP}-${type}"
|
|
||||||
local dir="${EVENTS_DIR}/${NODE_NAME}"
|
|
||||||
mkdir -p "$dir"
|
|
||||||
cat > "${dir}/${id}.json" <<EOF
|
|
||||||
{
|
|
||||||
"id": "${id}",
|
|
||||||
"timestamp": ${TIMESTAMP},
|
|
||||||
"date": "${DATE}",
|
|
||||||
"type": "${type}",
|
|
||||||
"severity": "${severity}",
|
|
||||||
"node": "${NODE_NAME}",
|
|
||||||
"service": "${service}",
|
|
||||||
"message": "${message}",
|
|
||||||
"payload": ${payload}
|
|
||||||
}
|
|
||||||
EOF
|
|
||||||
}
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Health checks
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
check_disk() {
|
|
||||||
# Use /opt/homelab as the check target — it lives on the host filesystem
|
|
||||||
# and this path is correct both when running natively and in a container
|
|
||||||
# that mounts /opt/homelab from the host.
|
|
||||||
local mount="${RUNTIME_PATH}"
|
|
||||||
local usage_pct avail_mb total_mb
|
|
||||||
usage_pct=$(df "${mount}" 2>/dev/null | awk 'NR==2 {gsub(/%/,"",$5); print $5}') || return
|
|
||||||
avail_mb=$(df "${mount}" 2>/dev/null | awk 'NR==2 {printf "%d", $4/1024}') || return
|
|
||||||
total_mb=$(df "${mount}" 2>/dev/null | awk 'NR==2 {printf "%d", $2/1024}') || return
|
|
||||||
|
|
||||||
if [[ "${usage_pct}" -ge "${DISK_CRIT_PCT}" ]]; then
|
|
||||||
warn "Disk CRITICAL: ${usage_pct}% used (${avail_mb} MB free)"
|
|
||||||
emit_event "disk_pressure" "high" "" \
|
|
||||||
"Disk usage critical: ${usage_pct}% on ${mount} (${avail_mb} MB free)" \
|
|
||||||
"{\"usage_pct\": ${usage_pct}, \"avail_mb\": ${avail_mb}, \"total_mb\": ${total_mb}, \"mount\": \"${mount}\"}"
|
|
||||||
elif [[ "${usage_pct}" -ge "${DISK_WARN_PCT}" ]]; then
|
|
||||||
warn "Disk elevated: ${usage_pct}% used"
|
|
||||||
emit_event "disk_pressure" "medium" "" \
|
|
||||||
"Disk usage elevated: ${usage_pct}% on ${mount} (${avail_mb} MB free)" \
|
|
||||||
"{\"usage_pct\": ${usage_pct}, \"avail_mb\": ${avail_mb}, \"total_mb\": ${total_mb}, \"mount\": \"${mount}\"}"
|
|
||||||
fi
|
|
||||||
echo "${usage_pct}"
|
|
||||||
}
|
|
||||||
|
|
||||||
check_memory() {
|
|
||||||
local total avail pct avail_mb
|
|
||||||
total=$(awk '/^MemTotal/ {print $2}' /proc/meminfo)
|
|
||||||
avail=$(awk '/^MemAvailable/ {print $2}' /proc/meminfo)
|
|
||||||
pct=$(( (total - avail) * 100 / total ))
|
|
||||||
avail_mb=$(( avail / 1024 ))
|
|
||||||
|
|
||||||
if [[ "${pct}" -ge "${MEM_CRIT_PCT}" ]]; then
|
|
||||||
warn "Memory CRITICAL: ${pct}% used"
|
|
||||||
emit_event "high_memory" "high" "" \
|
|
||||||
"Memory usage critical: ${pct}% (${avail_mb} MB available)" \
|
|
||||||
"{\"usage_pct\": ${pct}, \"avail_mb\": ${avail_mb}, \"total_mb\": $((total/1024))}"
|
|
||||||
elif [[ "${pct}" -ge "${MEM_WARN_PCT}" ]]; then
|
|
||||||
warn "Memory elevated: ${pct}%"
|
|
||||||
emit_event "high_memory" "medium" "" \
|
|
||||||
"Memory usage elevated: ${pct}% (${avail_mb} MB available)" \
|
|
||||||
"{\"usage_pct\": ${pct}, \"avail_mb\": ${avail_mb}, \"total_mb\": $((total/1024))}"
|
|
||||||
fi
|
|
||||||
echo "${pct}"
|
|
||||||
}
|
|
||||||
|
|
||||||
check_cpu() {
|
|
||||||
# Two-sample /proc/stat delta for accurate instantaneous CPU usage.
|
|
||||||
local idle1 total1 idle2 total2 pct
|
|
||||||
read -r idle1 total1 < <(awk '/^cpu / {idle=$5; total=0; for(i=2;i<=NF;i++) total+=$i; print idle, total}' /proc/stat)
|
|
||||||
sleep 1
|
|
||||||
read -r idle2 total2 < <(awk '/^cpu / {idle=$5; total=0; for(i=2;i<=NF;i++) total+=$i; print idle, total}' /proc/stat)
|
|
||||||
|
|
||||||
local d_idle=$(( idle2 - idle1 ))
|
|
||||||
local d_total=$(( total2 - total1 ))
|
|
||||||
pct=$(( d_total > 0 ? 100 - d_idle * 100 / d_total : 0 ))
|
|
||||||
|
|
||||||
if [[ "${pct}" -ge 90 ]]; then
|
|
||||||
warn "CPU elevated: ${pct}%"
|
|
||||||
emit_event "high_cpu" "medium" "" \
|
|
||||||
"CPU usage elevated: ${pct}%" \
|
|
||||||
"{\"usage_pct\": ${pct}}"
|
|
||||||
fi
|
|
||||||
echo "${pct}"
|
|
||||||
}
|
|
||||||
|
|
||||||
check_containers() {
|
|
||||||
command -v docker &>/dev/null || return
|
|
||||||
|
|
||||||
# Containers that have exited but carry a restart policy meaning they should be up
|
|
||||||
local cname
|
|
||||||
while IFS= read -r cname; do
|
|
||||||
[[ -z "$cname" ]] && continue
|
|
||||||
warn "Container exited (should be running): ${cname}"
|
|
||||||
emit_event "containers_not_running" "high" "${cname}" \
|
|
||||||
"Container '${cname}' has exited unexpectedly (restart=unless-stopped)" \
|
|
||||||
"{\"container\": \"${cname}\"}"
|
|
||||||
done < <(docker ps -a \
|
|
||||||
--filter "status=exited" \
|
|
||||||
--filter "label=com.docker.compose.project" \
|
|
||||||
--format "{{.Names}}" 2>/dev/null || true)
|
|
||||||
|
|
||||||
# Containers that are running but their health check is failing
|
|
||||||
while IFS= read -r cname; do
|
|
||||||
[[ -z "$cname" ]] && continue
|
|
||||||
warn "Container unhealthy: ${cname}"
|
|
||||||
emit_event "healthcheck_failed" "high" "${cname}" \
|
|
||||||
"Container '${cname}' is running but health check is failing" \
|
|
||||||
"{\"container\": \"${cname}\"}"
|
|
||||||
done < <(docker ps \
|
|
||||||
--filter "health=unhealthy" \
|
|
||||||
--format "{{.Names}}" 2>/dev/null || true)
|
|
||||||
}
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Safe Docker cleanup (per policy)
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
_sd_card_rate_ok() {
|
|
||||||
if [[ -f "${CLEANUP_LOCK}" ]]; then
|
|
||||||
local last_ts elapsed
|
|
||||||
last_ts=$(cat "${CLEANUP_LOCK}" 2>/dev/null || echo 0)
|
|
||||||
elapsed=$(( TIMESTAMP - last_ts ))
|
|
||||||
if [[ "${elapsed}" -lt "${CLEANUP_INTERVAL}" ]]; then
|
|
||||||
log "Docker cleanup skipped: last run ${elapsed}s ago (limit ${CLEANUP_INTERVAL}s)"
|
|
||||||
return 1
|
|
||||||
fi
|
|
||||||
fi
|
|
||||||
return 0
|
|
||||||
}
|
|
||||||
|
|
||||||
_mark_cleanup_done() {
|
|
||||||
echo "${TIMESTAMP}" > "${CLEANUP_LOCK}"
|
|
||||||
}
|
|
||||||
|
|
||||||
run_safe_cleanup() {
|
|
||||||
command -v docker &>/dev/null || return
|
|
||||||
local node_type
|
|
||||||
node_type=$(get_node_type)
|
|
||||||
|
|
||||||
case "${node_type}" in
|
|
||||||
lte_node)
|
|
||||||
# NO cleanup on LTE nodes. Any docker operation risks triggering
|
|
||||||
# a pull over a metered/intermittent connection.
|
|
||||||
log "Skipping Docker cleanup: LTE node (${NODE_NAME})"
|
|
||||||
;;
|
|
||||||
|
|
||||||
sd_card)
|
|
||||||
# Dangling images + stopped containers only.
|
|
||||||
# Rate-limited to once per 24 hours to protect SD card write endurance.
|
|
||||||
_sd_card_rate_ok || return
|
|
||||||
log "Running rate-limited Docker cleanup (SD card node)"
|
|
||||||
docker image prune -f >/dev/null 2>&1 || true
|
|
||||||
docker container prune -f >/dev/null 2>&1 || true
|
|
||||||
_mark_cleanup_done
|
|
||||||
;;
|
|
||||||
|
|
||||||
ai_node)
|
|
||||||
# Dangling images + stopped containers + build cache.
|
|
||||||
# NEVER docker image prune -a (would remove Ollama runtime images,
|
|
||||||
# requiring a multi-hour re-pull of model weights).
|
|
||||||
log "Running AI-node Docker cleanup (dangling images + containers + build cache)"
|
|
||||||
docker image prune -f >/dev/null 2>&1 || true
|
|
||||||
docker container prune -f >/dev/null 2>&1 || true
|
|
||||||
docker builder prune -f >/dev/null 2>&1 || true
|
|
||||||
;;
|
|
||||||
|
|
||||||
standard)
|
|
||||||
# VPS and other standard nodes: full safe cleanup.
|
|
||||||
log "Running standard Docker cleanup"
|
|
||||||
docker image prune -f >/dev/null 2>&1 || true
|
|
||||||
docker container prune -f >/dev/null 2>&1 || true
|
|
||||||
docker builder prune -f >/dev/null 2>&1 || true
|
|
||||||
;;
|
|
||||||
esac
|
|
||||||
}
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# VPS-specific: control-plane filesystem rotation
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
cleanup_control_plane_fs() {
|
|
||||||
log "Running control-plane filesystem rotation"
|
|
||||||
|
|
||||||
# Completed / failed actions older than 7 days
|
|
||||||
for status in completed failed; do
|
|
||||||
local dir="${ACTIONS_DIR}/${status}"
|
|
||||||
[[ -d "${dir}" ]] || continue
|
|
||||||
find "${dir}" -name "*.json" -mtime +7 -delete 2>/dev/null && \
|
|
||||||
log "Cleaned ${status} actions older than 7 days" || true
|
|
||||||
done
|
|
||||||
|
|
||||||
# Deploy logs older than 30 days
|
|
||||||
local deploy_logs="${LOGS_DIR}/deploy"
|
|
||||||
if [[ -d "${deploy_logs}" ]]; then
|
|
||||||
find "${deploy_logs}" -name "*.log" -mtime +30 -delete 2>/dev/null && \
|
|
||||||
log "Cleaned deploy logs older than 30 days" || true
|
|
||||||
fi
|
|
||||||
|
|
||||||
# Event files older than 3 days AND already past the observer checkpoint.
|
|
||||||
# The dual condition ensures we never delete an event the observer hasn't seen.
|
|
||||||
local checkpoint="${STATE_DIR}/observer_checkpoint.json"
|
|
||||||
if [[ -f "${checkpoint}" ]] && command -v python3 &>/dev/null; then
|
|
||||||
local last_processed
|
|
||||||
last_processed=$(python3 -c "
|
|
||||||
import json, sys
|
|
||||||
try:
|
|
||||||
d = json.load(open('${checkpoint}'))
|
|
||||||
print(d.get('last_processed_file', ''))
|
|
||||||
except Exception:
|
|
||||||
print('')
|
|
||||||
" 2>/dev/null || echo "")
|
|
||||||
|
|
||||||
if [[ -n "${last_processed}" ]]; then
|
|
||||||
find "${EVENTS_DIR}" -name "*.json" -mtime +3 | while IFS= read -r f; do
|
|
||||||
# Only delete files that sort before the checkpoint path
|
|
||||||
# (i.e., the observer has already processed them).
|
|
||||||
if [[ "$f" < "${last_processed}" ]]; then
|
|
||||||
rm -f "$f"
|
|
||||||
log "Cleaned old event: $(basename "$f")"
|
|
||||||
fi
|
|
||||||
done
|
|
||||||
else
|
|
||||||
log "No observer checkpoint set; skipping event file cleanup"
|
|
||||||
fi
|
|
||||||
fi
|
|
||||||
}
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Main
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
mkdir -p "${EVENTS_DIR}/${NODE_NAME}" "${STATE_DIR}"
|
|
||||||
|
|
||||||
log "Health check starting on ${NODE_NAME} (type=$(get_node_type))"
|
|
||||||
|
|
||||||
disk_pct=$(check_disk || echo 0)
|
|
||||||
mem_pct=$(check_memory || echo 0)
|
|
||||||
cpu_pct=$(check_cpu || echo 0)
|
|
||||||
check_containers
|
|
||||||
|
|
||||||
run_safe_cleanup
|
|
||||||
|
|
||||||
# VPS: also rotate control-plane filesystem artefacts
|
|
||||||
if [[ "${NODE_NAME}" == "vps" ]]; then
|
|
||||||
cleanup_control_plane_fs
|
|
||||||
fi
|
|
||||||
|
|
||||||
# Emit a node_health heartbeat so the observer can update node status
|
|
||||||
# and the supervisor can see up-to-date resource metrics.
|
|
||||||
emit_event "node_health" "info" "" \
|
|
||||||
"Health check completed on ${NODE_NAME}" \
|
|
||||||
"{\"disk_pct\": ${disk_pct}, \"mem_pct\": ${mem_pct}, \"cpu_pct\": ${cpu_pct}}"
|
|
||||||
|
|
||||||
log "Health check complete (disk=${disk_pct}% mem=${mem_pct}% cpu=${cpu_pct}%)"
|
|
||||||
|
|
@ -1,520 +0,0 @@
|
||||||
import os
|
|
||||||
import json
|
|
||||||
import time
|
|
||||||
import glob
|
|
||||||
import logging
|
|
||||||
import yaml
|
|
||||||
from datetime import datetime, timezone
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
|
|
||||||
def _atomic_write_json(path: Path, data) -> None:
|
|
||||||
"""Write JSON atomically: write to a sibling .tmp, fsync, then os.replace."""
|
|
||||||
tmp = path.with_suffix(".tmp")
|
|
||||||
with open(tmp, "w") as f:
|
|
||||||
json.dump(data, f, indent=2)
|
|
||||||
f.flush()
|
|
||||||
os.fsync(f.fileno())
|
|
||||||
os.replace(tmp, path)
|
|
||||||
|
|
||||||
|
|
||||||
def _parse_ts(ts) -> float:
|
|
||||||
"""Return a Unix timestamp float from ts, which may be int/float or an ISO-8601 string.
|
|
||||||
|
|
||||||
Events from node-agent use int(time.time()); events from stability-agent / events.py
|
|
||||||
use ISO format ('2026-06-03T10:30:00Z'). Both appear in incident fields such as
|
|
||||||
last_occurrence and resolved_at, so any arithmetic on them must go through here.
|
|
||||||
Returns 0.0 on None or unparseable input so callers can use plain comparisons.
|
|
||||||
"""
|
|
||||||
if ts is None:
|
|
||||||
return 0.0
|
|
||||||
if isinstance(ts, (int, float)):
|
|
||||||
return float(ts)
|
|
||||||
try:
|
|
||||||
return datetime.fromisoformat(str(ts).replace("Z", "+00:00")).timestamp()
|
|
||||||
except Exception:
|
|
||||||
return 0.0
|
|
||||||
|
|
||||||
# Constants and Paths
|
|
||||||
RUNTIME_PATH = os.getenv("RUNTIME_PATH", "/opt/homelab")
|
|
||||||
EVENTS_DIR = Path(RUNTIME_PATH) / "events"
|
|
||||||
STATE_DIR = Path(RUNTIME_PATH) / "state"
|
|
||||||
LOGS_DIR = Path(RUNTIME_PATH) / "logs"
|
|
||||||
WORLD_DIR = Path(RUNTIME_PATH) / "world"
|
|
||||||
OBSERVER_STATE_FILE = STATE_DIR / "observer_checkpoint.json"
|
|
||||||
|
|
||||||
REPO_ROOT = Path(__file__).parent.parent.parent
|
|
||||||
INVENTORY_TOPOLOGY = REPO_ROOT / "inventory" / "topology.yaml"
|
|
||||||
|
|
||||||
# Logging setup
|
|
||||||
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
|
|
||||||
logger = logging.getLogger("observer")
|
|
||||||
|
|
||||||
class Observer:
|
|
||||||
def __init__(self):
|
|
||||||
# Per-node-directory checkpoint: {"vps": "last/file/path", "piha": "last/file/path"}
|
|
||||||
# Replaces the old single last_processed_file which silently skipped event dirs
|
|
||||||
# that sort alphabetically before the checkpoint (e.g. piha/ < vps/).
|
|
||||||
self.node_checkpoints: dict = {}
|
|
||||||
self.world_state = {
|
|
||||||
"nodes": {},
|
|
||||||
"services": {},
|
|
||||||
"deployments": {},
|
|
||||||
"incidents": {},
|
|
||||||
"summary": {
|
|
||||||
"last_update": datetime.now(timezone.utc).isoformat(),
|
|
||||||
"status": "initializing",
|
|
||||||
"active_incidents_count": 0
|
|
||||||
}
|
|
||||||
}
|
|
||||||
self.inventory = self._load_inventory()
|
|
||||||
self._ensure_dirs()
|
|
||||||
self._load_checkpoint()
|
|
||||||
|
|
||||||
def _ensure_dirs(self):
|
|
||||||
WORLD_DIR.mkdir(parents=True, exist_ok=True)
|
|
||||||
STATE_DIR.mkdir(parents=True, exist_ok=True)
|
|
||||||
EVENTS_DIR.mkdir(parents=True, exist_ok=True)
|
|
||||||
LOGS_DIR.mkdir(parents=True, exist_ok=True)
|
|
||||||
|
|
||||||
def _load_inventory(self):
|
|
||||||
inventory = {"nodes": {}, "services": {}}
|
|
||||||
try:
|
|
||||||
if INVENTORY_TOPOLOGY.exists():
|
|
||||||
with open(INVENTORY_TOPOLOGY, "r") as f:
|
|
||||||
topo = yaml.safe_load(f)
|
|
||||||
for node_name, node_info in topo.get("nodes", {}).items():
|
|
||||||
inventory["nodes"][node_name] = {
|
|
||||||
"roles": node_info.get("roles", []),
|
|
||||||
"connectivity": node_info.get("connectivity", {})
|
|
||||||
}
|
|
||||||
|
|
||||||
# Load service assignments from hosts files
|
|
||||||
hosts_dir = REPO_ROOT / "hosts"
|
|
||||||
for host_dir in hosts_dir.iterdir():
|
|
||||||
if host_dir.is_dir():
|
|
||||||
svc_file = host_dir / "services.yaml"
|
|
||||||
if svc_file.exists():
|
|
||||||
with open(svc_file, "r") as f:
|
|
||||||
svc_data = yaml.safe_load(f)
|
|
||||||
host_name = svc_data.get("host")
|
|
||||||
for svc_name, svc_info in svc_data.get("services", {}).items():
|
|
||||||
if host_name not in inventory["services"]:
|
|
||||||
inventory["services"][host_name] = {}
|
|
||||||
inventory["services"][host_name][svc_name] = {
|
|
||||||
"role": svc_info.get("role"),
|
|
||||||
"exposure": svc_info.get("exposure")
|
|
||||||
}
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Failed to load inventory: {e}")
|
|
||||||
return inventory
|
|
||||||
|
|
||||||
def _load_checkpoint(self):
|
|
||||||
if OBSERVER_STATE_FILE.exists():
|
|
||||||
try:
|
|
||||||
with open(OBSERVER_STATE_FILE, "r") as f:
|
|
||||||
checkpoint = json.load(f)
|
|
||||||
|
|
||||||
if "node_checkpoints" in checkpoint:
|
|
||||||
# New format: per-directory checkpoints.
|
|
||||||
self.node_checkpoints = checkpoint["node_checkpoints"]
|
|
||||||
elif "last_processed_file" in checkpoint:
|
|
||||||
# Migrate old single-file checkpoint: extract node dir from path.
|
|
||||||
old = checkpoint["last_processed_file"]
|
|
||||||
if old:
|
|
||||||
try:
|
|
||||||
node_dir = Path(old).relative_to(EVENTS_DIR).parts[0]
|
|
||||||
self.node_checkpoints = {node_dir: old}
|
|
||||||
logger.info(f"Migrated old checkpoint → node_checkpoints: {self.node_checkpoints}")
|
|
||||||
except Exception:
|
|
||||||
pass # Bad path — start fresh
|
|
||||||
|
|
||||||
self._load_world_from_disk()
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Failed to load checkpoint: {e}")
|
|
||||||
|
|
||||||
def _load_world_from_disk(self):
|
|
||||||
# Optional: Load existing state to resume faster
|
|
||||||
files = {
|
|
||||||
"nodes": WORLD_DIR / "nodes.json",
|
|
||||||
"services": WORLD_DIR / "services.json",
|
|
||||||
"deployments": WORLD_DIR / "deployments.json",
|
|
||||||
"incidents": WORLD_DIR / "incidents.json",
|
|
||||||
"summary": WORLD_DIR / "runtime-summary.json"
|
|
||||||
}
|
|
||||||
for key, path in files.items():
|
|
||||||
if path.exists():
|
|
||||||
try:
|
|
||||||
with open(path, "r") as f:
|
|
||||||
self.world_state[key] = json.load(f)
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Failed to load {key} state: {e}")
|
|
||||||
|
|
||||||
def _save_checkpoint(self):
|
|
||||||
try:
|
|
||||||
_atomic_write_json(OBSERVER_STATE_FILE, {"node_checkpoints": self.node_checkpoints})
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Failed to save checkpoint: {e}")
|
|
||||||
|
|
||||||
def _prune_stale_world(self):
|
|
||||||
"""Remove world-state entries for nodes absent from the topology inventory.
|
|
||||||
|
|
||||||
Root cause this guards against: when NODE_NAME env var is unset, node_agent.py
|
|
||||||
falls back to socket.gethostname(), which inside a Docker container returns the
|
|
||||||
12-char hex container ID (e.g. 'be17cb6eb0f6') instead of the canonical host name
|
|
||||||
('vps'). The observer ingests those events and creates ghost entries that never
|
|
||||||
expire on their own.
|
|
||||||
|
|
||||||
Also ages out resolved incidents older than 7 days to keep world state lean.
|
|
||||||
"""
|
|
||||||
known_nodes = set(self.inventory["nodes"].keys())
|
|
||||||
if not known_nodes:
|
|
||||||
# Inventory failed to load — don't prune to avoid wiping valid state.
|
|
||||||
return
|
|
||||||
|
|
||||||
stale_nodes = [n for n in list(self.world_state["nodes"].keys())
|
|
||||||
if n not in known_nodes]
|
|
||||||
for n in stale_nodes:
|
|
||||||
logger.info(f"Pruning stale node from world state: {n}")
|
|
||||||
del self.world_state["nodes"][n]
|
|
||||||
|
|
||||||
stale_svcs = [k for k in list(self.world_state["services"].keys())
|
|
||||||
if k.split("/")[0] in stale_nodes]
|
|
||||||
for k in stale_svcs:
|
|
||||||
logger.info(f"Pruning stale service from world state: {k}")
|
|
||||||
del self.world_state["services"][k]
|
|
||||||
|
|
||||||
# Prune ghost service keys whose service-name portion is a hash-prefixed
|
|
||||||
# Docker stale-state artifact (e.g. "9e36297651e7_control-plane-observer").
|
|
||||||
# These are created when node-agent incorrectly uses c.name instead of the
|
|
||||||
# compose label, and accumulate on every container rebuild.
|
|
||||||
# Pattern: <node>/<12hexchars>_<real-name>
|
|
||||||
ghost_svcs = [
|
|
||||||
k for k in list(self.world_state["services"].keys())
|
|
||||||
if len(k.split("/", 1)) == 2
|
|
||||||
and len(k.split("/", 1)[1]) > 13
|
|
||||||
and k.split("/", 1)[1][12] == "_"
|
|
||||||
and all(ch in "0123456789abcdef" for ch in k.split("/", 1)[1][:12])
|
|
||||||
]
|
|
||||||
for k in ghost_svcs:
|
|
||||||
logger.info(f"Pruning ghost (hash-prefixed) service key from world state: {k}")
|
|
||||||
del self.world_state["services"][k]
|
|
||||||
|
|
||||||
now = time.time()
|
|
||||||
|
|
||||||
try:
|
|
||||||
# Collect incident_ids currently referenced by any service entry.
|
|
||||||
linked_ids: set = {
|
|
||||||
svc.get("incident_id")
|
|
||||||
for svc in self.world_state["services"].values()
|
|
||||||
if svc.get("incident_id")
|
|
||||||
}
|
|
||||||
|
|
||||||
# Case 1 — service is healthy but still points at an active incident.
|
|
||||||
# process_event already calls _resolve_incident on service_healthy events,
|
|
||||||
# but if the observer restarted with on-disk state where the link was
|
|
||||||
# intact (inconsistency from a pre-atomic-write crash), it may not get
|
|
||||||
# resolved until the next service_healthy event is processed. Resolve
|
|
||||||
# immediately — a healthy service cannot have an ongoing incident.
|
|
||||||
for svc_key, svc in self.world_state["services"].items():
|
|
||||||
if svc.get("status") != "healthy":
|
|
||||||
continue
|
|
||||||
inc_id = svc.get("incident_id")
|
|
||||||
if not inc_id:
|
|
||||||
continue
|
|
||||||
inc = self.world_state["incidents"].get(inc_id, {})
|
|
||||||
if inc.get("status") == "active":
|
|
||||||
logger.info(
|
|
||||||
f"Auto-resolving incident {inc_id} for {svc_key}: "
|
|
||||||
f"service is healthy"
|
|
||||||
)
|
|
||||||
inc["status"] = "resolved"
|
|
||||||
inc["resolved_at"] = now
|
|
||||||
svc["incident_id"] = None
|
|
||||||
linked_ids.discard(inc_id)
|
|
||||||
|
|
||||||
# Case 2 — orphaned active incident: no service entry links to it and
|
|
||||||
# last_occurrence is older than 5 minutes (guard against creation races).
|
|
||||||
# These are the stale records left behind when on-disk state was
|
|
||||||
# inconsistent: the service entry had incident_id cleared but incidents.json
|
|
||||||
# still had the record as "active".
|
|
||||||
for inc_id, inc in self.world_state["incidents"].items():
|
|
||||||
if inc.get("status") != "active":
|
|
||||||
continue
|
|
||||||
if inc_id in linked_ids:
|
|
||||||
continue
|
|
||||||
age = now - _parse_ts(inc.get("last_occurrence"))
|
|
||||||
if age > 300: # 5-minute guard
|
|
||||||
logger.info(
|
|
||||||
f"Auto-resolving orphaned incident {inc_id} "
|
|
||||||
f"(service={inc.get('service')}, node={inc.get('node')}): "
|
|
||||||
f"no service references it, age={int(age)}s"
|
|
||||||
)
|
|
||||||
inc["status"] = "resolved"
|
|
||||||
inc["resolved_at"] = now
|
|
||||||
|
|
||||||
except Exception as exc:
|
|
||||||
logger.error(f"Error during incident auto-resolve in _prune_stale_world: {exc}")
|
|
||||||
|
|
||||||
# Remove resolved incidents older than 7 days.
|
|
||||||
# Use _parse_ts so ISO-string resolved_at values are handled correctly.
|
|
||||||
stale_incidents = [
|
|
||||||
k for k, v in self.world_state["incidents"].items()
|
|
||||||
if v.get("status") == "resolved"
|
|
||||||
and now - _parse_ts(v.get("resolved_at")) > 7 * 86400
|
|
||||||
]
|
|
||||||
for k in stale_incidents:
|
|
||||||
del self.world_state["incidents"][k]
|
|
||||||
|
|
||||||
def _save_world(self):
|
|
||||||
self.world_state["summary"]["last_update"] = datetime.now(timezone.utc).isoformat()
|
|
||||||
active_incidents = [
|
|
||||||
k for k, v in self.world_state["incidents"].items() if v.get("status") == "active"
|
|
||||||
]
|
|
||||||
self.world_state["summary"]["active_incidents_count"] = len(active_incidents)
|
|
||||||
self.world_state["summary"]["node_count"] = len(self.world_state["nodes"])
|
|
||||||
self.world_state["summary"]["service_count"] = len(self.world_state["services"])
|
|
||||||
|
|
||||||
if active_incidents:
|
|
||||||
self.world_state["summary"]["status"] = "degraded"
|
|
||||||
else:
|
|
||||||
self.world_state["summary"]["status"] = "nominal"
|
|
||||||
|
|
||||||
files = {
|
|
||||||
"nodes.json": self.world_state["nodes"],
|
|
||||||
"services.json": self.world_state["services"],
|
|
||||||
"deployments.json": self.world_state["deployments"],
|
|
||||||
"incidents.json": self.world_state["incidents"],
|
|
||||||
"recommendations.json": [],
|
|
||||||
"runtime-summary.json": self.world_state["summary"]
|
|
||||||
}
|
|
||||||
for filename, data in files.items():
|
|
||||||
try:
|
|
||||||
_atomic_write_json(WORLD_DIR / filename, data)
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Failed to save {filename}: {e}")
|
|
||||||
|
|
||||||
def process_event(self, event):
|
|
||||||
etype = event.get("type")
|
|
||||||
node = event.get("node")
|
|
||||||
service = event.get("service")
|
|
||||||
severity = event.get("severity")
|
|
||||||
timestamp = event.get("timestamp")
|
|
||||||
cid = event.get("correlation_id")
|
|
||||||
payload = event.get("payload", {})
|
|
||||||
|
|
||||||
# 1. Update Node State
|
|
||||||
if node not in self.world_state["nodes"]:
|
|
||||||
self.world_state["nodes"][node] = {
|
|
||||||
"status": "unknown",
|
|
||||||
"last_seen": None,
|
|
||||||
"roles": self.inventory["nodes"].get(node, {}).get("roles", [])
|
|
||||||
}
|
|
||||||
self.world_state["nodes"][node]["last_seen"] = timestamp
|
|
||||||
|
|
||||||
if etype == "node_online":
|
|
||||||
self.world_state["nodes"][node]["status"] = "online"
|
|
||||||
elif etype == "node_offline":
|
|
||||||
self.world_state["nodes"][node]["status"] = "offline"
|
|
||||||
|
|
||||||
elif etype == "node_health":
|
|
||||||
# Regular heartbeat from node-agent; updates resource metrics.
|
|
||||||
# Clears disk_pressure if disk is now healthy (< warn threshold).
|
|
||||||
self.world_state["nodes"][node]["status"] = "online"
|
|
||||||
self.world_state["nodes"][node].update({
|
|
||||||
"disk_usage_pct": payload.get("disk_pct"),
|
|
||||||
"mem_usage_pct": payload.get("mem_pct"),
|
|
||||||
"cpu_usage_pct": payload.get("cpu_pct"),
|
|
||||||
})
|
|
||||||
if (payload.get("disk_pct") or 0) < 75:
|
|
||||||
self.world_state["nodes"][node].pop("disk_pressure", None)
|
|
||||||
|
|
||||||
elif etype == "disk_pressure":
|
|
||||||
# Emitted when disk usage crosses 75 % (medium) or 85 % (high).
|
|
||||||
# The supervisor reads disk_pressure to generate disk_cleanup actions.
|
|
||||||
self.world_state["nodes"][node]["disk_pressure"] = severity
|
|
||||||
self.world_state["nodes"][node]["disk_usage_pct"] = payload.get("usage_pct")
|
|
||||||
|
|
||||||
elif etype == "high_memory":
|
|
||||||
# Memory pressure observation; recorded on the node for correlation.
|
|
||||||
# No automated action — operator decides if a container restart helps.
|
|
||||||
self.world_state["nodes"][node]["memory_pressure"] = severity
|
|
||||||
self.world_state["nodes"][node]["mem_usage_pct"] = payload.get("usage_pct")
|
|
||||||
|
|
||||||
elif etype == "high_cpu":
|
|
||||||
# CPU pressure observation; recorded for visibility.
|
|
||||||
self.world_state["nodes"][node]["cpu_pressure"] = severity
|
|
||||||
self.world_state["nodes"][node]["cpu_usage_pct"] = payload.get("usage_pct")
|
|
||||||
|
|
||||||
# 2. Update Service State
|
|
||||||
if service and service != "all":
|
|
||||||
svc_key = f"{node}/{service}"
|
|
||||||
if svc_key not in self.world_state["services"]:
|
|
||||||
self.world_state["services"][svc_key] = {
|
|
||||||
"node": node,
|
|
||||||
"service": service,
|
|
||||||
"status": "unknown",
|
|
||||||
"last_check": None,
|
|
||||||
"incident_id": None
|
|
||||||
}
|
|
||||||
self.world_state["services"][svc_key]["last_check"] = timestamp
|
|
||||||
|
|
||||||
if etype == "service_recovered":
|
|
||||||
self.world_state["services"][svc_key]["status"] = "healthy"
|
|
||||||
self._resolve_incident(svc_key, timestamp)
|
|
||||||
elif etype == "service_healthy":
|
|
||||||
# Positive confirmation from node-agent that a managed container
|
|
||||||
# is running. This keeps services.json populated so the supervisor
|
|
||||||
# can correctly detect drift (absent entry = never reported = unknown,
|
|
||||||
# not the same as confirmed missing).
|
|
||||||
# Also resolve any active incident — if a service that had been
|
|
||||||
# unhealthy/crashing is now confirmed healthy, the incident is over.
|
|
||||||
self.world_state["services"][svc_key]["status"] = "healthy"
|
|
||||||
self._resolve_incident(svc_key, timestamp)
|
|
||||||
elif etype in ["service_unhealthy", "healthcheck_failed"]:
|
|
||||||
self.world_state["services"][svc_key]["status"] = "unhealthy"
|
|
||||||
self._handle_incident(svc_key, event)
|
|
||||||
|
|
||||||
# 3. Update Deployment State
|
|
||||||
if etype.startswith("deployment_") and cid:
|
|
||||||
if cid not in self.world_state["deployments"]:
|
|
||||||
self.world_state["deployments"][cid] = {
|
|
||||||
"node": node,
|
|
||||||
"service": service,
|
|
||||||
"status": "unknown",
|
|
||||||
"started_at": None,
|
|
||||||
"finished_at": None,
|
|
||||||
"events": []
|
|
||||||
}
|
|
||||||
self.world_state["deployments"][cid]["events"].append({
|
|
||||||
"type": etype,
|
|
||||||
"timestamp": timestamp,
|
|
||||||
"payload": payload
|
|
||||||
})
|
|
||||||
if etype == "deployment_started":
|
|
||||||
self.world_state["deployments"][cid]["status"] = "in_progress"
|
|
||||||
self.world_state["deployments"][cid]["started_at"] = timestamp
|
|
||||||
elif etype == "deployment_completed":
|
|
||||||
self.world_state["deployments"][cid]["status"] = "completed"
|
|
||||||
self.world_state["deployments"][cid]["finished_at"] = timestamp
|
|
||||||
elif etype == "deployment_failed":
|
|
||||||
self.world_state["deployments"][cid]["status"] = "failed"
|
|
||||||
self.world_state["deployments"][cid]["finished_at"] = timestamp
|
|
||||||
# Deployment failure often creates an incident
|
|
||||||
self._handle_deployment_failure(event)
|
|
||||||
|
|
||||||
def _handle_incident(self, svc_key, event):
|
|
||||||
# Correlation: collapse repeated failures for the same service on the same node
|
|
||||||
active_incident = self.world_state["services"][svc_key].get("incident_id")
|
|
||||||
|
|
||||||
if active_incident and active_incident in self.world_state["incidents"]:
|
|
||||||
incident = self.world_state["incidents"][active_incident]
|
|
||||||
if incident["status"] == "active":
|
|
||||||
incident["last_occurrence"] = event["timestamp"]
|
|
||||||
incident["occurrence_count"] = incident.get("occurrence_count", 1) + 1
|
|
||||||
incident["events"].append(event["timestamp"])
|
|
||||||
return
|
|
||||||
|
|
||||||
# Create new incident
|
|
||||||
incident_id = f"inc-{int(time.time())}-{event.get('node')}-{event.get('service')}"
|
|
||||||
self.world_state["incidents"][incident_id] = {
|
|
||||||
"id": incident_id,
|
|
||||||
"node": event.get("node"),
|
|
||||||
"service": event.get("service"),
|
|
||||||
"status": "active",
|
|
||||||
"severity": event.get("severity"),
|
|
||||||
# trigger_type records the event type that opened this incident so that
|
|
||||||
# the supervisor can choose the appropriate remediation action
|
|
||||||
# (e.g. container_restart for containers_not_running / mqtt_unreachable
|
|
||||||
# vs. a full redeploy for other causes).
|
|
||||||
"trigger_type": event.get("type"),
|
|
||||||
"started_at": event.get("timestamp"),
|
|
||||||
"last_occurrence": event.get("timestamp"),
|
|
||||||
"occurrence_count": 1,
|
|
||||||
"events": [event["timestamp"]],
|
|
||||||
"correlation_id": event.get("correlation_id")
|
|
||||||
}
|
|
||||||
self.world_state["services"][svc_key]["incident_id"] = incident_id
|
|
||||||
|
|
||||||
def _resolve_incident(self, svc_key, timestamp):
|
|
||||||
incident_id = self.world_state["services"][svc_key].get("incident_id")
|
|
||||||
if incident_id and incident_id in self.world_state["incidents"]:
|
|
||||||
if self.world_state["incidents"][incident_id]["status"] == "active":
|
|
||||||
self.world_state["incidents"][incident_id]["status"] = "resolved"
|
|
||||||
self.world_state["incidents"][incident_id]["resolved_at"] = timestamp
|
|
||||||
self.world_state["services"][svc_key]["incident_id"] = None
|
|
||||||
|
|
||||||
def _handle_deployment_failure(self, event):
|
|
||||||
# Specific logic for deployment failures
|
|
||||||
svc_key = f"{event.get('node')}/{event.get('service')}"
|
|
||||||
self._handle_incident(svc_key, event)
|
|
||||||
|
|
||||||
# Link diagnostics if available in payload
|
|
||||||
incident_id = self.world_state["services"][svc_key].get("incident_id")
|
|
||||||
if incident_id and incident_id in self.world_state["incidents"]:
|
|
||||||
payload = event.get("payload", {})
|
|
||||||
if "diagnostics_file" in payload:
|
|
||||||
self.world_state["incidents"][incident_id]["diagnostics_ref"] = payload["diagnostics_file"]
|
|
||||||
elif "error" in payload:
|
|
||||||
self.world_state["incidents"][incident_id]["last_error"] = payload["error"]
|
|
||||||
|
|
||||||
def run_once(self):
|
|
||||||
# Update heartbeat
|
|
||||||
heartbeat_file = STATE_DIR / "observer.heartbeat"
|
|
||||||
try:
|
|
||||||
heartbeat_file.touch()
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Failed to touch heartbeat file: {e}")
|
|
||||||
|
|
||||||
# Collect all event files grouped by node directory.
|
|
||||||
# Per-node checkpoints are compared within each directory independently,
|
|
||||||
# so late-arriving events from remote nodes (sorted earlier in the path)
|
|
||||||
# are never skipped just because another node's checkpoint is further ahead.
|
|
||||||
all_files = sorted(glob.glob(str(EVENTS_DIR / "**" / "*.json"), recursive=True))
|
|
||||||
|
|
||||||
new_files = []
|
|
||||||
for file_path in all_files:
|
|
||||||
try:
|
|
||||||
node_dir = str(Path(file_path).relative_to(EVENTS_DIR).parts[0])
|
|
||||||
except (IndexError, ValueError):
|
|
||||||
node_dir = "__unknown__"
|
|
||||||
last_for_node = self.node_checkpoints.get(node_dir, "")
|
|
||||||
if file_path > last_for_node:
|
|
||||||
new_files.append((node_dir, file_path))
|
|
||||||
|
|
||||||
if not new_files:
|
|
||||||
# Even if no new events, prune stale entries and refresh summary freshness.
|
|
||||||
self._prune_stale_world()
|
|
||||||
self._save_world()
|
|
||||||
return
|
|
||||||
|
|
||||||
logger.info(f"Processing {len(new_files)} new events across "
|
|
||||||
f"{len({n for n, _ in new_files})} node(s)")
|
|
||||||
for node_dir, file_path in new_files:
|
|
||||||
try:
|
|
||||||
with open(file_path, "r") as f:
|
|
||||||
event = json.load(f)
|
|
||||||
self.process_event(event)
|
|
||||||
# Advance per-node checkpoint (only forward — no regression).
|
|
||||||
if file_path > self.node_checkpoints.get(node_dir, ""):
|
|
||||||
self.node_checkpoints[node_dir] = file_path
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Error processing {file_path}: {e}")
|
|
||||||
|
|
||||||
self._save_checkpoint()
|
|
||||||
self._prune_stale_world()
|
|
||||||
self._save_world()
|
|
||||||
|
|
||||||
def loop(self, interval=5):
|
|
||||||
logger.info("Starting observer loop")
|
|
||||||
while True:
|
|
||||||
self.run_once()
|
|
||||||
time.sleep(interval)
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
import sys
|
|
||||||
observer = Observer()
|
|
||||||
if "--run-once" in sys.argv:
|
|
||||||
observer.run_once()
|
|
||||||
else:
|
|
||||||
observer.loop()
|
|
||||||
|
|
@ -1,83 +0,0 @@
|
||||||
#!/usr/bin/env bash
|
|
||||||
mkdir -p /tmp/homelab/events/2026-05-12/saturn
|
|
||||||
mkdir -p /tmp/homelab/state
|
|
||||||
mkdir -p /tmp/homelab/logs
|
|
||||||
mkdir -p /tmp/homelab/world
|
|
||||||
|
|
||||||
cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/120000_node_online_1.json
|
|
||||||
{
|
|
||||||
"timestamp": "2026-05-12T12:00:00Z",
|
|
||||||
"node": "saturn",
|
|
||||||
"type": "node_online",
|
|
||||||
"severity": "info",
|
|
||||||
"source": "system",
|
|
||||||
"service": "all",
|
|
||||||
"correlation_id": "init",
|
|
||||||
"payload": {}
|
|
||||||
}
|
|
||||||
EOF
|
|
||||||
|
|
||||||
cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/120500_service_unhealthy_1.json
|
|
||||||
{
|
|
||||||
"timestamp": "2026-05-12T12:05:00Z",
|
|
||||||
"node": "saturn",
|
|
||||||
"type": "service_unhealthy",
|
|
||||||
"severity": "error",
|
|
||||||
"source": "healthcheck",
|
|
||||||
"service": "mosquitto",
|
|
||||||
"correlation_id": "hc-1",
|
|
||||||
"payload": {"error": "connection refused"}
|
|
||||||
}
|
|
||||||
EOF
|
|
||||||
|
|
||||||
cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/120600_service_unhealthy_2.json
|
|
||||||
{
|
|
||||||
"timestamp": "2026-05-12T12:06:00Z",
|
|
||||||
"node": "saturn",
|
|
||||||
"type": "service_unhealthy",
|
|
||||||
"severity": "error",
|
|
||||||
"source": "healthcheck",
|
|
||||||
"service": "mosquitto",
|
|
||||||
"correlation_id": "hc-2",
|
|
||||||
"payload": {"error": "connection refused"}
|
|
||||||
}
|
|
||||||
EOF
|
|
||||||
|
|
||||||
cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/121000_service_recovered_1.json
|
|
||||||
{
|
|
||||||
"timestamp": "2026-05-12T12:10:00Z",
|
|
||||||
"node": "saturn",
|
|
||||||
"type": "service_recovered",
|
|
||||||
"severity": "info",
|
|
||||||
"source": "healthcheck",
|
|
||||||
"service": "mosquitto",
|
|
||||||
"correlation_id": "hc-3",
|
|
||||||
"payload": {}
|
|
||||||
}
|
|
||||||
EOF
|
|
||||||
|
|
||||||
cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/121500_deployment_started_1.json
|
|
||||||
{
|
|
||||||
"timestamp": "2026-05-12T12:15:00Z",
|
|
||||||
"node": "saturn",
|
|
||||||
"type": "deployment_started",
|
|
||||||
"severity": "info",
|
|
||||||
"source": "deploy_agent",
|
|
||||||
"service": "mosquitto",
|
|
||||||
"correlation_id": "deploy-1",
|
|
||||||
"payload": {"version": "2.0.18"}
|
|
||||||
}
|
|
||||||
EOF
|
|
||||||
|
|
||||||
cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/121600_deployment_failed_1.json
|
|
||||||
{
|
|
||||||
"timestamp": "2026-05-12T12:16:00Z",
|
|
||||||
"node": "saturn",
|
|
||||||
"type": "deployment_failed",
|
|
||||||
"severity": "error",
|
|
||||||
"source": "deploy_agent",
|
|
||||||
"service": "mosquitto",
|
|
||||||
"correlation_id": "deploy-1",
|
|
||||||
"payload": {"error": "container crash", "diagnostics_file": "/opt/homelab/logs/diagnostics-deploy-1.log"}
|
|
||||||
}
|
|
||||||
EOF
|
|
||||||
|
|
@ -1,55 +0,0 @@
|
||||||
### Agent System
|
|
||||||
Central runtime materializer and Operator Control Plane UI.
|
|
||||||
|
|
||||||
#### Components
|
|
||||||
- **Redis**: Central state store (on PIHA).
|
|
||||||
- **Runtime Materializer**: Converts Redis state to JSON files in `/opt/homelab/world`.
|
|
||||||
- **Web UI**: Exposes API endpoints and serving the Operator UI.
|
|
||||||
- **Telegram Bot**: Provides operator commands and action approvals via Telegram.
|
|
||||||
|
|
||||||
#### Configuration
|
|
||||||
Environment variables should be set in `.env` (see `env.example`).
|
|
||||||
Key variables for the Telegram Bot:
|
|
||||||
- `TELEGRAM_BOT_TOKEN`: Your bot token from @BotFather.
|
|
||||||
- `TELEGRAM_ALLOWED_USER_IDS`: Comma-separated list of authorized Telegram User IDs.
|
|
||||||
- `CONTROL_PLANE_URL`: URL to the `agent-system-webui` (default: `http://webui:8080`).
|
|
||||||
|
|
||||||
#### Telegram Commands
|
|
||||||
- `/status`: Check bot and API connectivity.
|
|
||||||
- `/summary`: System health overview.
|
|
||||||
- `/nodes`: List homelab nodes and their status.
|
|
||||||
- `/services`: Summary of services across nodes.
|
|
||||||
- `/unhealthy`: List all unhealthy components.
|
|
||||||
- `/incidents`: View active incidents.
|
|
||||||
- `/actions`: Summary of operator actions.
|
|
||||||
- `/help`: List all commands.
|
|
||||||
|
|
||||||
#### Deployment (on PIHA)
|
|
||||||
```bash
|
|
||||||
cd services/agent-system
|
|
||||||
./deploy.sh
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Deployment (on CHELSTY)
|
|
||||||
```bash
|
|
||||||
cd services/stability-agent
|
|
||||||
docker compose up -d --build
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Verification
|
|
||||||
The `deploy.sh` script automatically verifies the local endpoints.
|
|
||||||
You can also manually check:
|
|
||||||
```bash
|
|
||||||
# Check runtime summary
|
|
||||||
curl http://localhost:18180/summary
|
|
||||||
|
|
||||||
# Check discovered nodes
|
|
||||||
curl http://localhost:18180/nodes
|
|
||||||
|
|
||||||
# Check discovered services
|
|
||||||
curl http://localhost:18180/services
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Directory Structure
|
|
||||||
- `/opt/homelab/world`: Contains materialized JSON state.
|
|
||||||
- `/opt/homelab/state`: Contains operator configuration and local heartbeats.
|
|
||||||
|
|
@ -1,52 +0,0 @@
|
||||||
### Action Approval Data Model
|
|
||||||
|
|
||||||
Actions are JSON files stored in `/opt/homelab/actions/{status}/{action_id}.json`.
|
|
||||||
|
|
||||||
#### Statuses
|
|
||||||
- `pending`: Waiting for operator approval. AI agents create actions in this state.
|
|
||||||
- `approved`: Approved by operator, ready for execution.
|
|
||||||
- `rejected`: Rejected by operator, will not be executed.
|
|
||||||
- `running`: Currently being executed by an agent (e.g. `materializer`).
|
|
||||||
- `completed`: Successfully executed.
|
|
||||||
- `failed`: Execution failed.
|
|
||||||
|
|
||||||
#### Human-in-the-Loop (HIL) Protocol
|
|
||||||
1. **Request**: Agent identifies a required change and writes a JSON to `actions/pending/`.
|
|
||||||
2. **Notification**: System notifies the human operator.
|
|
||||||
3. **Audit**: Human reviews `details.reason` and `details.diff`.
|
|
||||||
4. **Authorization**: Human moves file to `approved/`.
|
|
||||||
5. **Execution**: Agent monitors `approved/` and executes the task.
|
|
||||||
|
|
||||||
#### Schema
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"action_id": "string",
|
|
||||||
"service": "string",
|
|
||||||
"node": "string",
|
|
||||||
"type": "deploy_service | restart_service | rollback | scale",
|
|
||||||
"risk": "nominal | guarded | critical",
|
|
||||||
"status": "pending | approved | rejected | ...",
|
|
||||||
"created_at": <unix_seconds>,
|
|
||||||
"updated_at": <unix_seconds>,
|
|
||||||
"details": {
|
|
||||||
"image": "string",
|
|
||||||
"reason": "string",
|
|
||||||
"diff": "string"
|
|
||||||
},
|
|
||||||
"transition_history": [
|
|
||||||
{
|
|
||||||
"from": "string | null",
|
|
||||||
"to": "string",
|
|
||||||
"timestamp": <unix_seconds>,
|
|
||||||
"by": "string (system | operator-tg-12345 | webui)"
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Workflow
|
|
||||||
1. A system component (e.g. `runtime-materializer` or a future analyzer) creates a file in `actions/pending/`.
|
|
||||||
2. `telegram-bot` detects the file, sends a message to allowed users.
|
|
||||||
3. Operator clicks "Approve" or "Reject".
|
|
||||||
4. `telegram-bot` moves the file to `actions/approved/` or `actions/rejected/` atomically, appending a transition to `transition_history`.
|
|
||||||
5. The responsible agent (e.g. `stability-agent` on the target node) picks up the `approved` action, moves it to `running`, executes it, and finally moves it to `completed` or `failed`.
|
|
||||||
|
|
@ -1,28 +0,0 @@
|
||||||
#!/bin/bash
|
|
||||||
set -e
|
|
||||||
|
|
||||||
echo ">>> Validating docker-compose configuration..."
|
|
||||||
docker compose config
|
|
||||||
|
|
||||||
echo ">>> Building and starting Agent System services..."
|
|
||||||
docker compose up -d --build
|
|
||||||
|
|
||||||
echo ">>> Services status:"
|
|
||||||
docker ps --filter "name=agent-system" --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"
|
|
||||||
|
|
||||||
if [ -z "$TELEGRAM_BOT_TOKEN" ]; then
|
|
||||||
echo ">>> Telegram bot status: DISABLED (token missing)"
|
|
||||||
else
|
|
||||||
echo ">>> Telegram bot status: ENABLED"
|
|
||||||
fi
|
|
||||||
|
|
||||||
echo ">>> Verifying API endpoints..."
|
|
||||||
sleep 5 # Give it a moment to start
|
|
||||||
|
|
||||||
endpoints=("summary" "nodes" "services")
|
|
||||||
for ep in "${endpoints[@]}"; do
|
|
||||||
echo "Checking /$ep..."
|
|
||||||
curl -s -f http://localhost:18180/$ep > /dev/null && echo " OK" || echo " FAILED"
|
|
||||||
done
|
|
||||||
|
|
||||||
echo ">>> Deployment complete."
|
|
||||||
|
|
@ -1,47 +0,0 @@
|
||||||
services:
|
|
||||||
redis:
|
|
||||||
image: redis:7
|
|
||||||
container_name: agent-system-redis
|
|
||||||
ports:
|
|
||||||
- "6379:6379"
|
|
||||||
restart: unless-stopped
|
|
||||||
|
|
||||||
webui:
|
|
||||||
build: ./webui
|
|
||||||
container_name: agent-system-webui
|
|
||||||
ports:
|
|
||||||
- "18180:8080"
|
|
||||||
volumes:
|
|
||||||
- /opt/homelab:/opt/homelab
|
|
||||||
depends_on:
|
|
||||||
- redis
|
|
||||||
restart: unless-stopped
|
|
||||||
|
|
||||||
runtime-materializer:
|
|
||||||
build: ./runtime-materializer
|
|
||||||
container_name: agent-system-runtime-materializer
|
|
||||||
environment:
|
|
||||||
REDIS_HOST: redis
|
|
||||||
REDIS_PORT: "6379"
|
|
||||||
HOMELAB_WORLD_ROOT: /opt/homelab/world
|
|
||||||
WORLD_DIR: /opt/homelab/world
|
|
||||||
MATERIALIZE_INTERVAL: "10"
|
|
||||||
volumes:
|
|
||||||
- /opt/homelab:/opt/homelab
|
|
||||||
depends_on:
|
|
||||||
- redis
|
|
||||||
restart: unless-stopped
|
|
||||||
|
|
||||||
telegram-bot:
|
|
||||||
build: ./telegram-bot
|
|
||||||
container_name: agent-system-telegram-bot
|
|
||||||
environment:
|
|
||||||
TELEGRAM_BOT_TOKEN: ${TELEGRAM_BOT_TOKEN}
|
|
||||||
TELEGRAM_ALLOWED_USER_IDS: ${TELEGRAM_ALLOWED_USER_IDS}
|
|
||||||
CONTROL_PLANE_URL: ${CONTROL_PLANE_URL:-http://webui:8080}
|
|
||||||
ENABLE_LLM_FALLBACK: ${ENABLE_LLM_FALLBACK:-false}
|
|
||||||
OPENCLAW_BASE_URL: ${OPENCLAW_BASE_URL}
|
|
||||||
ACTIONS_ROOT: /opt/homelab/actions
|
|
||||||
volumes:
|
|
||||||
- /opt/homelab:/opt/homelab
|
|
||||||
restart: on-failure
|
|
||||||
|
|
@ -1,19 +0,0 @@
|
||||||
# Telegram Bot Configuration
|
|
||||||
# Get token from @BotFather
|
|
||||||
TELEGRAM_BOT_TOKEN=123456789:ABCdefGHIjklMNOpqrsTUVwxyz
|
|
||||||
# Comma-separated list of Telegram User IDs
|
|
||||||
TELEGRAM_ALLOWED_USER_IDS=12345678,87654321
|
|
||||||
# Local control-plane API (default is internal compose address)
|
|
||||||
CONTROL_PLANE_URL=http://webui:8080
|
|
||||||
# Optional LLM fallback logic
|
|
||||||
ENABLE_LLM_FALLBACK=false
|
|
||||||
OPENCLAW_BASE_URL=http://openclaw.internal
|
|
||||||
|
|
||||||
# Runtime Materializer Configuration
|
|
||||||
REDIS_HOST=100.108.208.3
|
|
||||||
REDIS_PORT=6379
|
|
||||||
|
|
||||||
# Paths
|
|
||||||
HOMELAB_ROOT=/opt/homelab
|
|
||||||
ACTIONS_ROOT=/opt/homelab/actions
|
|
||||||
WORLD_DIR=/opt/homelab/world
|
|
||||||
|
|
@ -1,16 +0,0 @@
|
||||||
FROM python:3.11-slim
|
|
||||||
|
|
||||||
WORKDIR /app
|
|
||||||
|
|
||||||
# Install redis python package as requested
|
|
||||||
RUN pip install --no-cache-dir redis
|
|
||||||
|
|
||||||
COPY materializer.py .
|
|
||||||
|
|
||||||
# Ensure the world directory exists in the container (though it will likely be a volume)
|
|
||||||
RUN mkdir -p /opt/homelab/world
|
|
||||||
|
|
||||||
# Use unbuffered output to see logs in docker
|
|
||||||
ENV PYTHONUNBUFFERED=1
|
|
||||||
|
|
||||||
CMD ["python", "materializer.py"]
|
|
||||||
|
|
@ -1,251 +0,0 @@
|
||||||
import redis
|
|
||||||
import json
|
|
||||||
import os
|
|
||||||
import time
|
|
||||||
import argparse
|
|
||||||
import urllib.request
|
|
||||||
import urllib.error
|
|
||||||
from datetime import datetime
|
|
||||||
|
|
||||||
# Configuration from environment variables
|
|
||||||
REDIS_HOST = os.environ.get("REDIS_HOST", "redis")
|
|
||||||
REDIS_PORT = int(os.environ.get("REDIS_PORT", 6379))
|
|
||||||
WORLD_DIR = os.environ.get("WORLD_DIR", "/opt/homelab/world")
|
|
||||||
|
|
||||||
# When set, materialize from the control-plane HTTP API instead of Redis.
|
|
||||||
# This is the authoritative source of truth: the observer writes clean world
|
|
||||||
# state to the control-plane API, which the materializer mirrors locally so
|
|
||||||
# the webui's /snapshot (and all other endpoints) reflect the same data.
|
|
||||||
#
|
|
||||||
# Example: CONTROL_PLANE_URL=http://100.95.58.48:18180
|
|
||||||
CONTROL_PLANE_URL = os.environ.get("CONTROL_PLANE_URL", "").rstrip("/")
|
|
||||||
|
|
||||||
|
|
||||||
def get_redis_client():
|
|
||||||
"""Returns a Redis client with decoding enabled."""
|
|
||||||
return redis.Redis(
|
|
||||||
host=REDIS_HOST,
|
|
||||||
port=REDIS_PORT,
|
|
||||||
decode_responses=True,
|
|
||||||
socket_timeout=5
|
|
||||||
)
|
|
||||||
|
|
||||||
def safe_json_loads(data, default=None):
|
|
||||||
"""Safely loads JSON from a string."""
|
|
||||||
if not data:
|
|
||||||
return default
|
|
||||||
try:
|
|
||||||
if isinstance(data, (dict, list)):
|
|
||||||
return data
|
|
||||||
return json.loads(data)
|
|
||||||
except (json.JSONDecodeError, TypeError):
|
|
||||||
return data
|
|
||||||
|
|
||||||
def normalize_health(health):
|
|
||||||
"""Normalizes health values for the UI."""
|
|
||||||
if not health:
|
|
||||||
return "nominal"
|
|
||||||
h = str(health).lower()
|
|
||||||
if h in ["healthy", "ok", "running", "nominal"]:
|
|
||||||
return "nominal"
|
|
||||||
if h in ["degraded", "warning"]:
|
|
||||||
return "degraded"
|
|
||||||
return "error"
|
|
||||||
|
|
||||||
|
|
||||||
def _fetch_json(url):
|
|
||||||
"""Fetch JSON from a URL, returning parsed data or None on error."""
|
|
||||||
try:
|
|
||||||
with urllib.request.urlopen(url, timeout=10) as resp:
|
|
||||||
return json.loads(resp.read())
|
|
||||||
except Exception as e:
|
|
||||||
print(f"[{datetime.now().isoformat()}] Error fetching {url}: {e}")
|
|
||||||
return None
|
|
||||||
|
|
||||||
|
|
||||||
def write_json(filename, data):
|
|
||||||
path = os.path.join(WORLD_DIR, filename)
|
|
||||||
with open(path, "w") as f:
|
|
||||||
json.dump(data, f, indent=2)
|
|
||||||
|
|
||||||
|
|
||||||
def materialize_from_api():
|
|
||||||
"""Mirror world state from the control-plane API to local world files.
|
|
||||||
|
|
||||||
The control-plane observer on VPS is the single authoritative writer of
|
|
||||||
world state. By fetching from its HTTP API we get the same clean, pruned
|
|
||||||
data that the /summary endpoint serves — no stale Redis artefacts.
|
|
||||||
|
|
||||||
Returns True if all fetches succeeded and files were written, False otherwise.
|
|
||||||
"""
|
|
||||||
print(f"[{datetime.now().isoformat()}] Materializing from control-plane API: {CONTROL_PLANE_URL}")
|
|
||||||
|
|
||||||
endpoints = {
|
|
||||||
"nodes.json": f"{CONTROL_PLANE_URL}/nodes",
|
|
||||||
"services.json": f"{CONTROL_PLANE_URL}/services",
|
|
||||||
"incidents.json": f"{CONTROL_PLANE_URL}/incidents",
|
|
||||||
"deployments.json": f"{CONTROL_PLANE_URL}/deployments",
|
|
||||||
"recommendations.json":f"{CONTROL_PLANE_URL}/recommendations",
|
|
||||||
"runtime-summary.json":f"{CONTROL_PLANE_URL}/summary",
|
|
||||||
"events.json": f"{CONTROL_PLANE_URL}/events",
|
|
||||||
}
|
|
||||||
|
|
||||||
fetched = {}
|
|
||||||
for filename, url in endpoints.items():
|
|
||||||
data = _fetch_json(url)
|
|
||||||
if data is None:
|
|
||||||
print(f"[{datetime.now().isoformat()}] Aborting: failed to fetch {url}")
|
|
||||||
return False
|
|
||||||
fetched[filename] = data
|
|
||||||
|
|
||||||
os.makedirs(WORLD_DIR, exist_ok=True)
|
|
||||||
for filename, data in fetched.items():
|
|
||||||
write_json(filename, data)
|
|
||||||
|
|
||||||
svc_count = len(fetched.get("services.json") or [])
|
|
||||||
print(f"[{datetime.now().isoformat()}] Materialized from API: {svc_count} services → {WORLD_DIR}")
|
|
||||||
return True
|
|
||||||
|
|
||||||
|
|
||||||
def materialize():
|
|
||||||
"""Reads state from Redis and writes JSON files to the world directory."""
|
|
||||||
print(f"[{datetime.now().isoformat()}] Materializing world state...")
|
|
||||||
try:
|
|
||||||
r = get_redis_client()
|
|
||||||
|
|
||||||
# 1. Nodes
|
|
||||||
nodes = []
|
|
||||||
node_keys = r.keys("homelab:nodes:*")
|
|
||||||
for key in node_keys:
|
|
||||||
node_data = r.hgetall(key)
|
|
||||||
if node_data:
|
|
||||||
# Normalize health
|
|
||||||
if "health" in node_data:
|
|
||||||
node_data["health"] = normalize_health(node_data["health"])
|
|
||||||
# Parse JSON fields if they exist
|
|
||||||
if "capabilities" in node_data:
|
|
||||||
node_data["capabilities"] = safe_json_loads(node_data["capabilities"], [])
|
|
||||||
if "checks" in node_data:
|
|
||||||
node_data["checks"] = safe_json_loads(node_data["checks"], {})
|
|
||||||
nodes.append(node_data)
|
|
||||||
|
|
||||||
# 2. Services
|
|
||||||
services = []
|
|
||||||
service_keys = r.keys("homelab:services:*")
|
|
||||||
for key in service_keys:
|
|
||||||
svc_data = r.hgetall(key)
|
|
||||||
if svc_data:
|
|
||||||
# Normalize health
|
|
||||||
if "health" in svc_data:
|
|
||||||
svc_data["health"] = normalize_health(svc_data["health"])
|
|
||||||
if "dependencies" in svc_data:
|
|
||||||
svc_data["dependencies"] = safe_json_loads(svc_data["dependencies"], [])
|
|
||||||
if "recommendations" in svc_data:
|
|
||||||
svc_data["recommendations"] = safe_json_loads(svc_data["recommendations"], [])
|
|
||||||
services.append(svc_data)
|
|
||||||
|
|
||||||
# 3. Events (Stream)
|
|
||||||
events = []
|
|
||||||
try:
|
|
||||||
# Get last 100 events from the stream
|
|
||||||
raw_events = r.xrevrange("homelab:events", count=100)
|
|
||||||
for event_id, data in raw_events:
|
|
||||||
event = data.copy()
|
|
||||||
event["id"] = event_id
|
|
||||||
if "details" in event:
|
|
||||||
event["details"] = safe_json_loads(event["details"], {})
|
|
||||||
events.append(event)
|
|
||||||
except redis.exceptions.ResponseError:
|
|
||||||
# homelab:events might not be a stream or doesn't exist
|
|
||||||
pass
|
|
||||||
|
|
||||||
# 4. Incidents (Hash)
|
|
||||||
incidents = []
|
|
||||||
incident_keys = r.keys("homelab:incidents:*")
|
|
||||||
for key in incident_keys:
|
|
||||||
incident_data = r.hgetall(key)
|
|
||||||
if incident_data:
|
|
||||||
# Normalize health if present
|
|
||||||
if "health" in incident_data:
|
|
||||||
incident_data["health"] = normalize_health(incident_data["health"])
|
|
||||||
incidents.append(incident_data)
|
|
||||||
|
|
||||||
# 5. Deployments (Hash)
|
|
||||||
deployments = []
|
|
||||||
deployment_keys = r.keys("homelab:deployments:*")
|
|
||||||
for key in deployment_keys:
|
|
||||||
dep_data = r.hgetall(key)
|
|
||||||
if dep_data:
|
|
||||||
deployments.append(dep_data)
|
|
||||||
|
|
||||||
# 6. Recommendations (Hash)
|
|
||||||
recommendations = []
|
|
||||||
recommendation_keys = r.keys("homelab:recommendations:*")
|
|
||||||
for key in recommendation_keys:
|
|
||||||
rec_data = r.hgetall(key)
|
|
||||||
if rec_data:
|
|
||||||
recommendations.append(rec_data)
|
|
||||||
|
|
||||||
# 7. Runtime Summary
|
|
||||||
unhealthy_services = [s for s in services if s.get("health") != "nominal"]
|
|
||||||
active_incidents = [i for i in incidents if i.get("status") not in ["resolved", "closed"]]
|
|
||||||
|
|
||||||
status = "nominal"
|
|
||||||
if len(active_incidents) > 0 or len(unhealthy_services) > 5:
|
|
||||||
status = "error"
|
|
||||||
elif len(unhealthy_services) > 0:
|
|
||||||
status = "degraded"
|
|
||||||
|
|
||||||
summary = {
|
|
||||||
"status": status,
|
|
||||||
"timestamp": datetime.utcnow().isoformat() + "Z",
|
|
||||||
"last_update": int(time.time()),
|
|
||||||
"node_count": len(nodes),
|
|
||||||
"service_count": len(services),
|
|
||||||
"active_incidents_count": len(active_incidents),
|
|
||||||
"unhealthy_services_count": len(unhealthy_services),
|
|
||||||
"incident_count": len(incidents),
|
|
||||||
"recent_events_count": len(events),
|
|
||||||
"stale": False
|
|
||||||
}
|
|
||||||
|
|
||||||
# Ensure directory exists
|
|
||||||
os.makedirs(WORLD_DIR, exist_ok=True)
|
|
||||||
|
|
||||||
write_json("runtime-summary.json", summary)
|
|
||||||
write_json("nodes.json", nodes)
|
|
||||||
write_json("services.json", services)
|
|
||||||
write_json("incidents.json", incidents)
|
|
||||||
write_json("events.json", events)
|
|
||||||
write_json("deployments.json", deployments)
|
|
||||||
write_json("recommendations.json", recommendations)
|
|
||||||
|
|
||||||
print(f"[{datetime.now().isoformat()}] Successfully materialized to {WORLD_DIR}")
|
|
||||||
|
|
||||||
except redis.exceptions.ConnectionError as e:
|
|
||||||
print(f"Redis connection error: {e}")
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Unexpected error during materialization: {e}")
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
parser = argparse.ArgumentParser(description="Homelab Runtime Materializer")
|
|
||||||
parser.add_argument("--once", action="store_true", help="Run once and exit")
|
|
||||||
parser.add_argument("--interval", type=int, default=30, help="Sleep interval between runs (seconds)")
|
|
||||||
args = parser.parse_args()
|
|
||||||
|
|
||||||
if CONTROL_PLANE_URL:
|
|
||||||
print(f"Mode: control-plane API ({CONTROL_PLANE_URL})")
|
|
||||||
run_fn = materialize_from_api
|
|
||||||
else:
|
|
||||||
print(f"Mode: Redis ({REDIS_HOST}:{REDIS_PORT})")
|
|
||||||
run_fn = materialize
|
|
||||||
|
|
||||||
interval = int(os.environ.get("MATERIALIZE_INTERVAL", args.interval))
|
|
||||||
|
|
||||||
if args.once:
|
|
||||||
run_fn()
|
|
||||||
else:
|
|
||||||
print(f"Starting materializer loop (interval: {interval}s)...")
|
|
||||||
while True:
|
|
||||||
run_fn()
|
|
||||||
time.sleep(interval)
|
|
||||||
|
|
@ -1,39 +0,0 @@
|
||||||
#!/bin/bash
|
|
||||||
# Script to create a test pending action for Telegram bot verification.
|
|
||||||
|
|
||||||
ACTIONS_PENDING_DIR=${ACTIONS_ROOT:-/opt/homelab/actions}/pending
|
|
||||||
mkdir -p "$ACTIONS_PENDING_DIR"
|
|
||||||
|
|
||||||
ACTION_ID="test-$(date +%s)"
|
|
||||||
FILE_PATH="$ACTIONS_PENDING_DIR/$ACTION_ID.json"
|
|
||||||
|
|
||||||
TIMESTAMP=$(date +%s)
|
|
||||||
|
|
||||||
cat <<EOF > "$FILE_PATH"
|
|
||||||
{
|
|
||||||
"action_id": "$ACTION_ID",
|
|
||||||
"service": "frigate",
|
|
||||||
"node": "chelsty",
|
|
||||||
"type": "deploy_service",
|
|
||||||
"risk": "guarded",
|
|
||||||
"status": "pending",
|
|
||||||
"created_at": $TIMESTAMP,
|
|
||||||
"updated_at": $TIMESTAMP,
|
|
||||||
"details": {
|
|
||||||
"image": "blakeblackshear/frigate:0.13.0",
|
|
||||||
"reason": "Security update for Frigate",
|
|
||||||
"diff": "image: blakeblackshear/frigate:0.12.0 -> 0.13.0"
|
|
||||||
},
|
|
||||||
"transition_history": [
|
|
||||||
{
|
|
||||||
"from": null,
|
|
||||||
"to": "pending",
|
|
||||||
"timestamp": $TIMESTAMP,
|
|
||||||
"by": "system-test"
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
EOF
|
|
||||||
|
|
||||||
echo "Test action created: $FILE_PATH"
|
|
||||||
echo "If the telegram-bot is running and configured, you should receive a notification."
|
|
||||||
|
|
@ -1,10 +0,0 @@
|
||||||
FROM python:3.11-slim
|
|
||||||
|
|
||||||
WORKDIR /app
|
|
||||||
|
|
||||||
COPY requirements.txt .
|
|
||||||
RUN pip install --no-cache-dir -r requirements.txt
|
|
||||||
|
|
||||||
COPY bot.py .
|
|
||||||
|
|
||||||
CMD ["python", "bot.py"]
|
|
||||||
|
|
@ -1,454 +0,0 @@
|
||||||
import os
|
|
||||||
import json
|
|
||||||
import time
|
|
||||||
import asyncio
|
|
||||||
import logging
|
|
||||||
import urllib.request
|
|
||||||
import urllib.error
|
|
||||||
from pathlib import Path
|
|
||||||
from telegram import Update, InlineKeyboardButton, InlineKeyboardMarkup
|
|
||||||
from telegram.ext import ApplicationBuilder, ContextTypes, CommandHandler, CallbackQueryHandler, MessageHandler, filters
|
|
||||||
|
|
||||||
# Setup logging
|
|
||||||
logging.basicConfig(
|
|
||||||
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
|
|
||||||
level=logging.INFO
|
|
||||||
)
|
|
||||||
logger = logging.getLogger(__name__)
|
|
||||||
|
|
||||||
# Configuration
|
|
||||||
TOKEN = os.getenv("TELEGRAM_BOT_TOKEN")
|
|
||||||
ALLOWED_IDS = [int(i.strip()) for i in os.getenv("TELEGRAM_ALLOWED_USER_IDS", "").split(",") if i.strip()]
|
|
||||||
ACTIONS_ROOT = Path(os.getenv("ACTIONS_ROOT", "/opt/homelab/actions"))
|
|
||||||
CONTROL_PLANE_URL = os.getenv("CONTROL_PLANE_URL", "http://webui:8080")
|
|
||||||
ENABLE_LLM_FALLBACK = os.getenv("ENABLE_LLM_FALLBACK", "false").lower() == "true"
|
|
||||||
OPENCLAW_BASE_URL = os.getenv("OPENCLAW_BASE_URL")
|
|
||||||
|
|
||||||
async def fetch_api(path):
|
|
||||||
"""Helper to fetch JSON from the Control Plane API."""
|
|
||||||
url = f"{CONTROL_PLANE_URL.rstrip('/')}/{path.lstrip('/')}"
|
|
||||||
try:
|
|
||||||
def do_request():
|
|
||||||
req = urllib.request.Request(url)
|
|
||||||
with urllib.request.urlopen(req, timeout=5) as response:
|
|
||||||
if response.status != 200:
|
|
||||||
return None
|
|
||||||
return json.loads(response.read().decode())
|
|
||||||
return await asyncio.to_thread(do_request)
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Error fetching {url}: {e}")
|
|
||||||
return None
|
|
||||||
|
|
||||||
async def post_api(path, data):
|
|
||||||
"""Helper to POST JSON to the Control Plane API."""
|
|
||||||
url = f"{CONTROL_PLANE_URL.rstrip('/')}/{path.lstrip('/')}"
|
|
||||||
try:
|
|
||||||
body = json.dumps(data).encode("utf-8")
|
|
||||||
def do_request():
|
|
||||||
req = urllib.request.Request(url, data=body, method="POST")
|
|
||||||
req.add_header("Content-Type", "application/json")
|
|
||||||
with urllib.request.urlopen(req, timeout=5) as response:
|
|
||||||
return response.status == 200
|
|
||||||
return await asyncio.to_thread(do_request)
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Error posting to {url}: {e}")
|
|
||||||
return False
|
|
||||||
|
|
||||||
def _format_pending_action(action_id: str, data: dict) -> str:
|
|
||||||
"""Build the Telegram Markdown message for a pending action notification.
|
|
||||||
|
|
||||||
Extracted so it can be unit-tested without a live Telegram connection.
|
|
||||||
"""
|
|
||||||
# Supervisor writes risk_level; action-model.md legacy schema used risk.
|
|
||||||
risk = data.get("risk_level") or data.get("risk", "unknown")
|
|
||||||
message = (
|
|
||||||
f"⚠️ *Pending Action*\n"
|
|
||||||
f"ID: `{action_id}`\n"
|
|
||||||
f"Type: `{data.get('type', 'unknown')}`\n"
|
|
||||||
f"Service: `{data.get('service', 'unknown')}`\n"
|
|
||||||
f"Node: `{data.get('node', 'unknown')}`\n"
|
|
||||||
f"Risk: *{risk}*\n"
|
|
||||||
)
|
|
||||||
# description carries the human-readable substance of the action (required for
|
|
||||||
# alert_only actions where it is the entire operator-visible message).
|
|
||||||
description = data.get("description", "")
|
|
||||||
if description:
|
|
||||||
truncated = description[:300] + ("..." if len(description) > 300 else "")
|
|
||||||
message += f"Description: `{truncated}`\n"
|
|
||||||
# Legacy details block (old action-model.md schema) — kept for backwards compat.
|
|
||||||
if "details" in data:
|
|
||||||
details_str = json.dumps(data["details"], indent=2)
|
|
||||||
if len(details_str) > 1000:
|
|
||||||
details_str = details_str[:1000] + "..."
|
|
||||||
message += f"\nDetails:\n```json\n{details_str}\n```"
|
|
||||||
return message
|
|
||||||
|
|
||||||
|
|
||||||
class ApprovalBot:
|
|
||||||
def __init__(self):
|
|
||||||
self.pending_dir = ACTIONS_ROOT / "pending"
|
|
||||||
self.approved_dir = ACTIONS_ROOT / "approved"
|
|
||||||
self.rejected_dir = ACTIONS_ROOT / "rejected"
|
|
||||||
# Track which action IDs we have already notified in this session to avoid spam
|
|
||||||
self.notified_actions = set()
|
|
||||||
|
|
||||||
async def check_pending_actions(self, context: ContextTypes.DEFAULT_TYPE):
|
|
||||||
"""Job that periodically checks for new pending action files."""
|
|
||||||
if not self.pending_dir.exists():
|
|
||||||
return
|
|
||||||
|
|
||||||
try:
|
|
||||||
for action_file in self.pending_dir.glob("*.json"):
|
|
||||||
action_id = action_file.stem
|
|
||||||
if action_id in self.notified_actions:
|
|
||||||
continue
|
|
||||||
|
|
||||||
try:
|
|
||||||
data = json.loads(action_file.read_text())
|
|
||||||
# Only notify if it's truly pending
|
|
||||||
if data.get("status") == "pending":
|
|
||||||
await self.notify_users(context, action_id, data)
|
|
||||||
self.notified_actions.add(action_id)
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Error processing action file {action_file}: {e}")
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Error scanning pending directory: {e}")
|
|
||||||
|
|
||||||
async def notify_users(self, context: ContextTypes.DEFAULT_TYPE, action_id: str, data: dict):
|
|
||||||
"""Sends an approval request message to all allowed users."""
|
|
||||||
message = _format_pending_action(action_id, data)
|
|
||||||
|
|
||||||
keyboard = [
|
|
||||||
[
|
|
||||||
InlineKeyboardButton("✅ Approve", callback_data=f"approve:{action_id}"),
|
|
||||||
InlineKeyboardButton("❌ Reject", callback_data=f"reject:{action_id}"),
|
|
||||||
]
|
|
||||||
]
|
|
||||||
reply_markup = InlineKeyboardMarkup(keyboard)
|
|
||||||
|
|
||||||
for user_id in ALLOWED_IDS:
|
|
||||||
try:
|
|
||||||
await context.bot.send_message(
|
|
||||||
chat_id=user_id,
|
|
||||||
text=message,
|
|
||||||
parse_mode="Markdown",
|
|
||||||
reply_markup=reply_markup
|
|
||||||
)
|
|
||||||
logger.info(f"Notified user {user_id} about action {action_id}")
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Failed to notify user {user_id}: {e}")
|
|
||||||
|
|
||||||
async def handle_callback(self, update: Update, context: ContextTypes.DEFAULT_TYPE):
|
|
||||||
"""Handles button clicks for Approve/Reject."""
|
|
||||||
query = update.callback_query
|
|
||||||
user_id = query.from_user.id
|
|
||||||
|
|
||||||
if user_id not in ALLOWED_IDS:
|
|
||||||
await query.answer("Unauthorized", show_alert=True)
|
|
||||||
return
|
|
||||||
|
|
||||||
await query.answer()
|
|
||||||
|
|
||||||
cb_data = query.data
|
|
||||||
if ":" not in cb_data:
|
|
||||||
return
|
|
||||||
|
|
||||||
action, action_id = cb_data.split(":", 1)
|
|
||||||
target_status = "approved" if action == "approve" else "rejected"
|
|
||||||
|
|
||||||
# Use API for mutation if available, fallback to local disk move
|
|
||||||
success = await post_api("/action/mutate", {"id": action_id, "status": target_status})
|
|
||||||
msg = "Success" if success else "API call failed"
|
|
||||||
|
|
||||||
if not success:
|
|
||||||
# Fallback to direct disk manipulation (original behavior)
|
|
||||||
success, msg = self.move_action(action_id, target_status, user_id, query.from_user.username or str(user_id))
|
|
||||||
|
|
||||||
if success:
|
|
||||||
status_text = "✅ Approved" if target_status == "approved" else "❌ Rejected"
|
|
||||||
await query.edit_message_text(
|
|
||||||
text=query.message.text + f"\n\n{status_text} by {query.from_user.first_name}",
|
|
||||||
parse_mode="Markdown"
|
|
||||||
)
|
|
||||||
# Remove from notified list as it's no longer pending
|
|
||||||
if action_id in self.notified_actions:
|
|
||||||
self.notified_actions.remove(action_id)
|
|
||||||
else:
|
|
||||||
await query.message.reply_text(f"Failed to process action {action_id}: {msg}")
|
|
||||||
|
|
||||||
def move_action(self, action_id, target_status, user_id, username):
|
|
||||||
"""Moves action file and updates its status and history."""
|
|
||||||
source_path = self.pending_dir / f"{action_id}.json"
|
|
||||||
if not source_path.exists():
|
|
||||||
return False, "Action file no longer exists in pending."
|
|
||||||
|
|
||||||
target_dir = self.approved_dir if target_status == "approved" else self.rejected_dir
|
|
||||||
target_dir.mkdir(parents=True, exist_ok=True)
|
|
||||||
target_path = target_dir / f"{action_id}.json"
|
|
||||||
|
|
||||||
try:
|
|
||||||
data = json.loads(source_path.read_text())
|
|
||||||
current_status = data.get("status", "pending")
|
|
||||||
|
|
||||||
# Update data
|
|
||||||
data["status"] = target_status
|
|
||||||
data["updated_at"] = time.time()
|
|
||||||
|
|
||||||
history = data.get("transition_history", [])
|
|
||||||
history.append({
|
|
||||||
"from": current_status,
|
|
||||||
"to": target_status,
|
|
||||||
"timestamp": time.time(),
|
|
||||||
"by": f"tg:{username}"
|
|
||||||
})
|
|
||||||
data["transition_history"] = history
|
|
||||||
|
|
||||||
# Atomic move: write to new location, then delete old
|
|
||||||
target_path.write_text(json.dumps(data, indent=2))
|
|
||||||
source_path.unlink()
|
|
||||||
logger.info(f"Action {action_id} moved from {current_status} to {target_status} by {username}")
|
|
||||||
return True, "Success"
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Error moving action file: {e}")
|
|
||||||
return False, str(e)
|
|
||||||
|
|
||||||
async def start_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
|
||||||
"""Simple start command to help users find their ID."""
|
|
||||||
user = update.effective_user
|
|
||||||
message = (
|
|
||||||
f"Hello {user.first_name}! 🤖\n"
|
|
||||||
f"Your Telegram User ID is: `{user.id}`\n\n"
|
|
||||||
)
|
|
||||||
if user.id in ALLOWED_IDS:
|
|
||||||
message += "✅ You are authorized to manage the homelab.\n\n"
|
|
||||||
message += "Use /help to see available commands."
|
|
||||||
else:
|
|
||||||
message += "❌ You are NOT authorized. Add your ID to `TELEGRAM_ALLOWED_USER_IDS`."
|
|
||||||
|
|
||||||
await update.message.reply_text(message, parse_mode="Markdown")
|
|
||||||
|
|
||||||
async def status_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
|
||||||
if update.effective_user.id not in ALLOWED_IDS: return
|
|
||||||
res = await fetch_api("/summary")
|
|
||||||
status = "✅ Online" if res else "❌ Unreachable"
|
|
||||||
message = (
|
|
||||||
f"🤖 *Telegram Bot Status*\n"
|
|
||||||
f"Control Plane API: {status}\n"
|
|
||||||
f"Target URL: `{CONTROL_PLANE_URL}`\n"
|
|
||||||
)
|
|
||||||
await update.message.reply_text(message, parse_mode="Markdown")
|
|
||||||
|
|
||||||
async def summary_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
|
||||||
if update.effective_user.id not in ALLOWED_IDS: return
|
|
||||||
data = await fetch_api("/summary")
|
|
||||||
if not data:
|
|
||||||
await update.message.reply_text("❌ Failed to fetch summary from Control Plane.")
|
|
||||||
return
|
|
||||||
|
|
||||||
msg = "📊 *System Summary*\n"
|
|
||||||
msg += f"Status: `{data.get('status', 'unknown')}`\n"
|
|
||||||
msg += f"Nodes: {data.get('node_count', 0)}\n"
|
|
||||||
msg += f"Services: {data.get('service_count', 0)}\n"
|
|
||||||
msg += f"Active Incidents: {data.get('active_incidents_count', 0)}\n"
|
|
||||||
if data.get('stale'):
|
|
||||||
msg += "\n⚠️ *Warning: Data is stale!*"
|
|
||||||
|
|
||||||
await update.message.reply_text(msg, parse_mode="Markdown")
|
|
||||||
|
|
||||||
async def nodes_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
|
||||||
if update.effective_user.id not in ALLOWED_IDS: return
|
|
||||||
nodes = await fetch_api("/nodes")
|
|
||||||
if nodes is None:
|
|
||||||
await update.message.reply_text("❌ Failed to fetch nodes.")
|
|
||||||
return
|
|
||||||
|
|
||||||
if not nodes:
|
|
||||||
await update.message.reply_text("No nodes discovered in the fleet.")
|
|
||||||
return
|
|
||||||
|
|
||||||
msg = "🖥️ *Nodes Status*\n"
|
|
||||||
for node in nodes:
|
|
||||||
health_icon = "✅" if node.get('health') == 'nominal' else "⚠️" if node.get('health') == 'degraded' else "❌"
|
|
||||||
msg += f"{health_icon} *{node.get('hostname')}*: `{node.get('status', 'unknown')}`\n"
|
|
||||||
msg += f" Last seen: {node.get('last_seen', 'N/A')}\n"
|
|
||||||
|
|
||||||
await update.message.reply_text(msg, parse_mode="Markdown")
|
|
||||||
|
|
||||||
async def services_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
|
||||||
if update.effective_user.id not in ALLOWED_IDS: return
|
|
||||||
services = await fetch_api("/services")
|
|
||||||
if services is None:
|
|
||||||
await update.message.reply_text("❌ Failed to fetch services.")
|
|
||||||
return
|
|
||||||
|
|
||||||
# Summarize by node
|
|
||||||
nodes = {}
|
|
||||||
for s in services:
|
|
||||||
node = s.get("node", "unknown")
|
|
||||||
if node not in nodes: nodes[node] = []
|
|
||||||
nodes[node].append(s)
|
|
||||||
|
|
||||||
msg = "⚙️ *Services Summary*\n"
|
|
||||||
if not nodes:
|
|
||||||
msg += "No services discovered."
|
|
||||||
else:
|
|
||||||
for node, svc_list in sorted(nodes.items()):
|
|
||||||
nominal = len([s for s in svc_list if s.get("health") == "nominal"])
|
|
||||||
msg += f"• *{node}*: {nominal}/{len(svc_list)} nominal\n"
|
|
||||||
|
|
||||||
msg += "\nUse /unhealthy to see issues."
|
|
||||||
await update.message.reply_text(msg, parse_mode="Markdown")
|
|
||||||
|
|
||||||
async def unhealthy_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
|
||||||
if update.effective_user.id not in ALLOWED_IDS: return
|
|
||||||
services = await fetch_api("/services")
|
|
||||||
nodes = await fetch_api("/nodes")
|
|
||||||
|
|
||||||
msg = "⚠️ *Unhealthy Components*\n"
|
|
||||||
found = False
|
|
||||||
|
|
||||||
if services:
|
|
||||||
for s in services:
|
|
||||||
health = s.get("health", "").lower()
|
|
||||||
if health != "nominal":
|
|
||||||
msg += f"• Service *{s.get('name')}* on *{s.get('node')}*: `{health}`\n"
|
|
||||||
found = True
|
|
||||||
|
|
||||||
if nodes:
|
|
||||||
for n in nodes:
|
|
||||||
checks = n.get("checks", {})
|
|
||||||
if isinstance(checks, str):
|
|
||||||
try: checks = json.loads(checks)
|
|
||||||
except: checks = {}
|
|
||||||
|
|
||||||
docker = checks.get("docker", {})
|
|
||||||
if docker.get("status") == "ok":
|
|
||||||
for c in docker.get("containers", []):
|
|
||||||
if c.get("state") != "running":
|
|
||||||
msg += f"• Container *{c.get('name')}* on *{n.get('hostname')}*: `{c.get('state')}`\n"
|
|
||||||
found = True
|
|
||||||
|
|
||||||
if not found:
|
|
||||||
msg += "All systems nominal. ✅"
|
|
||||||
|
|
||||||
await update.message.reply_text(msg, parse_mode="Markdown")
|
|
||||||
|
|
||||||
async def incidents_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
|
||||||
if update.effective_user.id not in ALLOWED_IDS: return
|
|
||||||
incidents = await fetch_api("/incidents")
|
|
||||||
if incidents is None:
|
|
||||||
await update.message.reply_text("❌ Failed to fetch incidents.")
|
|
||||||
return
|
|
||||||
|
|
||||||
active = [i for i in incidents if i.get("status") not in ("resolved", "closed")]
|
|
||||||
if not active:
|
|
||||||
await update.message.reply_text("No active incidents. ✅")
|
|
||||||
return
|
|
||||||
|
|
||||||
msg = "🚨 *Active Incidents*\n"
|
|
||||||
for inc in active:
|
|
||||||
severity = inc.get('severity', 'info').upper()
|
|
||||||
msg += f"• [{severity}] *{inc.get('type')}*: {inc.get('message')}\n"
|
|
||||||
|
|
||||||
await update.message.reply_text(msg, parse_mode="Markdown")
|
|
||||||
|
|
||||||
async def actions_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
|
||||||
if update.effective_user.id not in ALLOWED_IDS: return
|
|
||||||
actions = await fetch_api("/actions")
|
|
||||||
if actions is None:
|
|
||||||
await update.message.reply_text("❌ Actions endpoint unavailable.")
|
|
||||||
return
|
|
||||||
|
|
||||||
msg = "⚡ *Actions Summary*\n"
|
|
||||||
total = 0
|
|
||||||
for status, act_list in actions.items():
|
|
||||||
if act_list:
|
|
||||||
msg += f"• {status.capitalize()}: {len(act_list)}\n"
|
|
||||||
total += len(act_list)
|
|
||||||
|
|
||||||
if total == 0:
|
|
||||||
msg = "No actions recorded."
|
|
||||||
|
|
||||||
await update.message.reply_text(msg, parse_mode="Markdown")
|
|
||||||
|
|
||||||
async def help_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
|
||||||
msg = (
|
|
||||||
"📖 *Supported Commands*\n\n"
|
|
||||||
"/status - Check bot and API connectivity\n"
|
|
||||||
"/summary - System health overview\n"
|
|
||||||
"/nodes - List homelab nodes and their status\n"
|
|
||||||
"/services - Summary of services across nodes\n"
|
|
||||||
"/unhealthy - List all unhealthy components\n"
|
|
||||||
"/incidents - View active incidents\n"
|
|
||||||
"/actions - Summary of operator actions\n"
|
|
||||||
"/help - Show this help message\n\n"
|
|
||||||
"Free text will be handled by the guidance system."
|
|
||||||
)
|
|
||||||
await update.message.reply_text(msg, parse_mode="Markdown")
|
|
||||||
|
|
||||||
async def handle_fallback(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
|
||||||
"""Handles non-command messages."""
|
|
||||||
if update.effective_user.id not in ALLOWED_IDS: return
|
|
||||||
|
|
||||||
if ENABLE_LLM_FALLBACK and OPENCLAW_BASE_URL:
|
|
||||||
# Placeholder for OpenClaw LLM fallback
|
|
||||||
# In a real scenario, this would call the LLM API
|
|
||||||
logger.info(f"LLM fallback requested for: {update.message.text}")
|
|
||||||
|
|
||||||
await update.message.reply_text(
|
|
||||||
"Use /summary, /nodes, /services, /unhealthy, /incidents, /actions."
|
|
||||||
)
|
|
||||||
|
|
||||||
async def run_bot():
|
|
||||||
if not TOKEN:
|
|
||||||
print("CRITICAL: TELEGRAM_BOT_TOKEN is not set. Telegram bot will not start.")
|
|
||||||
# Keep process alive to not crash compose if not desired, but here we just exit
|
|
||||||
# Requirement says: "do not fail if Telegram token is absent, but telegram-bot should be disabled or exit cleanly"
|
|
||||||
return
|
|
||||||
|
|
||||||
bot_logic = ApprovalBot()
|
|
||||||
|
|
||||||
application = ApplicationBuilder().token(TOKEN).build()
|
|
||||||
|
|
||||||
application.add_handler(CommandHandler("start", start_command))
|
|
||||||
application.add_handler(CommandHandler("status", status_command))
|
|
||||||
application.add_handler(CommandHandler("summary", summary_command))
|
|
||||||
application.add_handler(CommandHandler("nodes", nodes_command))
|
|
||||||
application.add_handler(CommandHandler("services", services_command))
|
|
||||||
application.add_handler(CommandHandler("unhealthy", unhealthy_command))
|
|
||||||
application.add_handler(CommandHandler("incidents", incidents_command))
|
|
||||||
application.add_handler(CommandHandler("actions", actions_command))
|
|
||||||
application.add_handler(CommandHandler("help", help_command))
|
|
||||||
|
|
||||||
application.add_handler(MessageHandler(filters.TEXT & (~filters.COMMAND), handle_fallback))
|
|
||||||
application.add_handler(CallbackQueryHandler(bot_logic.handle_callback))
|
|
||||||
|
|
||||||
# Schedule the pending actions check
|
|
||||||
job_queue = application.job_queue
|
|
||||||
if job_queue:
|
|
||||||
job_queue.run_repeating(bot_logic.check_pending_actions, interval=10, first=5)
|
|
||||||
else:
|
|
||||||
logger.warning("JobQueue is not available. Periodic pending actions check will be skipped.")
|
|
||||||
|
|
||||||
logger.info("Starting Telegram Approval Bot...")
|
|
||||||
await application.initialize()
|
|
||||||
await application.start()
|
|
||||||
await application.updater.start_polling()
|
|
||||||
|
|
||||||
# Run until the application is stopped
|
|
||||||
stop_event = asyncio.Event()
|
|
||||||
try:
|
|
||||||
await stop_event.wait()
|
|
||||||
except (KeyboardInterrupt, SystemExit):
|
|
||||||
logger.info("Stopping bot...")
|
|
||||||
finally:
|
|
||||||
await application.stop()
|
|
||||||
await application.shutdown()
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
try:
|
|
||||||
asyncio.run(run_bot())
|
|
||||||
except KeyboardInterrupt:
|
|
||||||
pass
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Fatal error: {e}")
|
|
||||||
|
|
@ -1 +0,0 @@
|
||||||
python-telegram-bot[job-queue]==20.7
|
|
||||||
|
|
@ -1,38 +0,0 @@
|
||||||
"""Stub telegram before bot.py is imported so pytest doesn't need the real package."""
|
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
import sys
|
|
||||||
import types
|
|
||||||
from unittest.mock import MagicMock
|
|
||||||
|
|
||||||
|
|
||||||
def _make_telegram_stub() -> types.ModuleType:
|
|
||||||
mod = types.ModuleType("telegram")
|
|
||||||
mod.Update = MagicMock
|
|
||||||
mod.InlineKeyboardButton = MagicMock
|
|
||||||
mod.InlineKeyboardMarkup = MagicMock
|
|
||||||
return mod
|
|
||||||
|
|
||||||
|
|
||||||
def _make_telegram_ext_stub() -> types.ModuleType:
|
|
||||||
mod = types.ModuleType("telegram.ext")
|
|
||||||
mod.ApplicationBuilder = MagicMock
|
|
||||||
|
|
||||||
# ContextTypes.DEFAULT_TYPE is referenced as a type annotation at class-body
|
|
||||||
# evaluation time, so it must be a real attribute, not a dynamic MagicMock attr.
|
|
||||||
ContextTypesMock = MagicMock()
|
|
||||||
ContextTypesMock.DEFAULT_TYPE = type(None)
|
|
||||||
mod.ContextTypes = ContextTypesMock
|
|
||||||
|
|
||||||
mod.CommandHandler = MagicMock
|
|
||||||
mod.CallbackQueryHandler = MagicMock
|
|
||||||
mod.MessageHandler = MagicMock
|
|
||||||
mod.filters = MagicMock()
|
|
||||||
return mod
|
|
||||||
|
|
||||||
|
|
||||||
# Insert before any import of bot.py
|
|
||||||
if "telegram" not in sys.modules:
|
|
||||||
sys.modules["telegram"] = _make_telegram_stub()
|
|
||||||
if "telegram.ext" not in sys.modules:
|
|
||||||
sys.modules["telegram.ext"] = _make_telegram_ext_stub()
|
|
||||||
|
|
@ -1,116 +0,0 @@
|
||||||
"""Tests for _format_pending_action — no Telegram connection required.
|
|
||||||
|
|
||||||
telegram stubs are set up in conftest.py before this module is imported.
|
|
||||||
"""
|
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
import sys
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
import pytest
|
|
||||||
|
|
||||||
sys.path.insert(0, str(Path(__file__).parent.parent))
|
|
||||||
from bot import _format_pending_action
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Bug 1 — risk_level field
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
def test_risk_level_shown_when_present():
|
|
||||||
data = {
|
|
||||||
"type": "container_restart", "service": "homeassistant",
|
|
||||||
"node": "chelsty-ha", "risk_level": "low",
|
|
||||||
}
|
|
||||||
msg = _format_pending_action("container-restart-chelsty-ha-homeassistant", data)
|
|
||||||
assert "Risk: *low*" in msg
|
|
||||||
assert "unknown" not in msg
|
|
||||||
|
|
||||||
|
|
||||||
def test_risk_falls_back_to_legacy_risk_key():
|
|
||||||
data = {
|
|
||||||
"type": "redeploy", "service": "mosquitto",
|
|
||||||
"node": "chelsty-infra", "risk": "guarded",
|
|
||||||
}
|
|
||||||
msg = _format_pending_action("redeploy-chelsty-infra-mosquitto", data)
|
|
||||||
assert "Risk: *guarded*" in msg
|
|
||||||
|
|
||||||
|
|
||||||
def test_risk_unknown_when_both_absent():
|
|
||||||
data = {"type": "redeploy", "service": "foo", "node": "bar"}
|
|
||||||
msg = _format_pending_action("redeploy-bar-foo", data)
|
|
||||||
assert "Risk: *unknown*" in msg
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Bug 2 — description field
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
def test_description_shown_for_alert_only():
|
|
||||||
data = {
|
|
||||||
"type": "alert_only", "service": "homeassistant",
|
|
||||||
"node": "chelsty-ha", "risk_level": "info",
|
|
||||||
"description": "3 entities unavailable for >1h",
|
|
||||||
}
|
|
||||||
msg = _format_pending_action("alert-ha-entity-unavailable-chelsty-ha", data)
|
|
||||||
assert "3 entities unavailable for >1h" in msg
|
|
||||||
assert "Description:" in msg
|
|
||||||
|
|
||||||
|
|
||||||
def test_description_shown_for_container_restart():
|
|
||||||
data = {
|
|
||||||
"type": "container_restart", "service": "homeassistant",
|
|
||||||
"node": "chelsty-ha", "risk_level": "low",
|
|
||||||
"description": "Restart 'homeassistant' on chelsty-ha: HA WebSocket unresponsive",
|
|
||||||
}
|
|
||||||
msg = _format_pending_action("container-restart-chelsty-ha-homeassistant", data)
|
|
||||||
assert "HA WebSocket unresponsive" in msg
|
|
||||||
|
|
||||||
|
|
||||||
def test_description_absent_no_crash():
|
|
||||||
data = {"type": "redeploy", "service": "foo", "node": "bar", "risk_level": "guarded"}
|
|
||||||
msg = _format_pending_action("redeploy-bar-foo", data)
|
|
||||||
assert "Description:" not in msg
|
|
||||||
assert "Risk: *guarded*" in msg
|
|
||||||
|
|
||||||
|
|
||||||
def test_description_truncated_at_300_chars():
|
|
||||||
long_desc = "x" * 400
|
|
||||||
data = {
|
|
||||||
"type": "alert_only", "service": "homeassistant",
|
|
||||||
"node": "chelsty-ha", "risk_level": "info",
|
|
||||||
"description": long_desc,
|
|
||||||
}
|
|
||||||
msg = _format_pending_action("alert-ha-foo-chelsty-ha", data)
|
|
||||||
assert "x" * 300 in msg
|
|
||||||
assert "..." in msg
|
|
||||||
assert "x" * 301 not in msg
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Combined — real HA alert_only action shape
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
def test_ha_alert_only_full_action():
|
|
||||||
"""Mirrors an actual alert_only action written by supervisor._generate_ha_alert_only."""
|
|
||||||
data = {
|
|
||||||
"action_id": "alert-ha-entity-unavailable-chelsty-ha",
|
|
||||||
"type": "alert_only",
|
|
||||||
"node": "chelsty-ha",
|
|
||||||
"service": "homeassistant",
|
|
||||||
"risk_level": "info",
|
|
||||||
"confidence": 1.0,
|
|
||||||
"description": "3 entities unavailable for >1h: sensor.power, binary_sensor.window",
|
|
||||||
"status": "pending",
|
|
||||||
"payload": {
|
|
||||||
"location_tag": "chelsty",
|
|
||||||
"reason": "ha_entity_unavailable_long",
|
|
||||||
"count": 3,
|
|
||||||
},
|
|
||||||
}
|
|
||||||
msg = _format_pending_action(data["action_id"], data)
|
|
||||||
assert "alert_only" in msg
|
|
||||||
assert "chelsty-ha" in msg
|
|
||||||
assert "Risk: *info*" in msg
|
|
||||||
assert "3 entities unavailable" in msg
|
|
||||||
assert "unknown" not in msg
|
|
||||||
|
|
@ -1,7 +0,0 @@
|
||||||
FROM python:3.11-slim
|
|
||||||
|
|
||||||
WORKDIR /app
|
|
||||||
COPY web.py index.html ./
|
|
||||||
|
|
||||||
EXPOSE 8080
|
|
||||||
CMD ["python", "web.py"]
|
|
||||||
|
|
@ -1,769 +0,0 @@
|
||||||
<!doctype html>
|
|
||||||
<html lang="en">
|
|
||||||
<head>
|
|
||||||
<meta charset="utf-8">
|
|
||||||
<meta name="viewport" content="width=device-width, initial-scale=1">
|
|
||||||
<title>Operator Control Plane</title>
|
|
||||||
<style>
|
|
||||||
:root {
|
|
||||||
--bg-color: #0a0c0e;
|
|
||||||
--sidebar-color: #14171a;
|
|
||||||
--card-color: #1c2024;
|
|
||||||
--border-color: #2a3540;
|
|
||||||
--text-color: #e7edf3;
|
|
||||||
--text-muted: #94a3b8;
|
|
||||||
--accent-color: #3eaf7c;
|
|
||||||
--nominal: #3eaf7c;
|
|
||||||
--degraded: #e7c000;
|
|
||||||
--unstable: #e67e22;
|
|
||||||
--reconciling: #3498db;
|
|
||||||
--error: #c0392b;
|
|
||||||
--safe: #3eaf7c;
|
|
||||||
--guarded: #e67e22;
|
|
||||||
--dangerous: #c0392b;
|
|
||||||
}
|
|
||||||
|
|
||||||
body {
|
|
||||||
margin: 0;
|
|
||||||
font-family: 'Inter', system-ui, -apple-system, sans-serif;
|
|
||||||
background: var(--bg-color);
|
|
||||||
color: var(--text-color);
|
|
||||||
display: flex;
|
|
||||||
height: 100vh;
|
|
||||||
overflow: hidden;
|
|
||||||
}
|
|
||||||
|
|
||||||
/* Sidebar */
|
|
||||||
.sidebar {
|
|
||||||
width: 240px;
|
|
||||||
background: var(--sidebar-color);
|
|
||||||
border-right: 1px solid var(--border-color);
|
|
||||||
display: flex;
|
|
||||||
flex-direction: column;
|
|
||||||
flex-shrink: 0;
|
|
||||||
}
|
|
||||||
|
|
||||||
.sidebar-header {
|
|
||||||
padding: 24px;
|
|
||||||
font-weight: 800;
|
|
||||||
font-size: 14px;
|
|
||||||
letter-spacing: 0.1em;
|
|
||||||
color: var(--accent-color);
|
|
||||||
border-bottom: 1px solid var(--border-color);
|
|
||||||
}
|
|
||||||
|
|
||||||
.nav-list {
|
|
||||||
list-style: none;
|
|
||||||
padding: 12px 0;
|
|
||||||
margin: 0;
|
|
||||||
flex-grow: 1;
|
|
||||||
}
|
|
||||||
|
|
||||||
.nav-item {
|
|
||||||
padding: 12px 24px;
|
|
||||||
cursor: pointer;
|
|
||||||
font-size: 14px;
|
|
||||||
color: var(--text-muted);
|
|
||||||
transition: all 0.2s;
|
|
||||||
display: flex;
|
|
||||||
align-items: center;
|
|
||||||
gap: 12px;
|
|
||||||
}
|
|
||||||
|
|
||||||
.nav-item:hover {
|
|
||||||
background: rgba(255, 255, 255, 0.05);
|
|
||||||
color: var(--text-color);
|
|
||||||
}
|
|
||||||
|
|
||||||
.nav-item.active {
|
|
||||||
background: rgba(62, 175, 124, 0.1);
|
|
||||||
color: var(--accent-color);
|
|
||||||
border-left: 3px solid var(--accent-color);
|
|
||||||
}
|
|
||||||
|
|
||||||
.sidebar-footer {
|
|
||||||
padding: 16px;
|
|
||||||
border-top: 1px solid var(--border-color);
|
|
||||||
font-size: 12px;
|
|
||||||
}
|
|
||||||
|
|
||||||
/* Content Area */
|
|
||||||
.main-content {
|
|
||||||
flex-grow: 1;
|
|
||||||
display: flex;
|
|
||||||
flex-direction: column;
|
|
||||||
overflow: hidden;
|
|
||||||
}
|
|
||||||
|
|
||||||
header {
|
|
||||||
height: 64px;
|
|
||||||
border-bottom: 1px solid var(--border-color);
|
|
||||||
display: flex;
|
|
||||||
align-items: center;
|
|
||||||
padding: 0 24px;
|
|
||||||
justify-content: space-between;
|
|
||||||
background: var(--bg-color);
|
|
||||||
}
|
|
||||||
|
|
||||||
.view-title {
|
|
||||||
font-size: 18px;
|
|
||||||
font-weight: 600;
|
|
||||||
}
|
|
||||||
|
|
||||||
.content-scroll {
|
|
||||||
flex-grow: 1;
|
|
||||||
overflow-y: auto;
|
|
||||||
padding: 24px;
|
|
||||||
}
|
|
||||||
|
|
||||||
/* Cards & Grids */
|
|
||||||
.grid {
|
|
||||||
display: grid;
|
|
||||||
grid-template-columns: repeat(auto-fill, minmax(350px, 1fr));
|
|
||||||
gap: 20px;
|
|
||||||
}
|
|
||||||
|
|
||||||
.card {
|
|
||||||
background: var(--card-color);
|
|
||||||
border: 1px solid var(--border-color);
|
|
||||||
padding: 20px;
|
|
||||||
border-radius: 4px;
|
|
||||||
position: relative;
|
|
||||||
}
|
|
||||||
|
|
||||||
.card-header {
|
|
||||||
display: flex;
|
|
||||||
justify-content: space-between;
|
|
||||||
align-items: center;
|
|
||||||
margin-bottom: 16px;
|
|
||||||
}
|
|
||||||
|
|
||||||
.card-title {
|
|
||||||
font-weight: 700;
|
|
||||||
font-size: 16px;
|
|
||||||
}
|
|
||||||
|
|
||||||
/* Status Badges */
|
|
||||||
.badge {
|
|
||||||
padding: 4px 8px;
|
|
||||||
border-radius: 4px;
|
|
||||||
font-size: 11px;
|
|
||||||
font-weight: 700;
|
|
||||||
text-transform: uppercase;
|
|
||||||
}
|
|
||||||
|
|
||||||
.status-nominal { background: rgba(62, 175, 124, 0.1); color: var(--nominal); }
|
|
||||||
.status-degraded { background: rgba(231, 192, 0, 0.1); color: var(--degraded); }
|
|
||||||
.status-unstable { background: rgba(230, 126, 34, 0.1); color: var(--unstable); }
|
|
||||||
.status-reconciling { background: rgba(52, 152, 219, 0.1); color: var(--reconciling); }
|
|
||||||
.status-error { background: rgba(192, 57, 43, 0.1); color: var(--error); }
|
|
||||||
|
|
||||||
/* Timeline */
|
|
||||||
.timeline {
|
|
||||||
display: flex;
|
|
||||||
flex-direction: column;
|
|
||||||
gap: 12px;
|
|
||||||
}
|
|
||||||
|
|
||||||
.event {
|
|
||||||
padding: 12px;
|
|
||||||
border-left: 2px solid var(--border-color);
|
|
||||||
background: rgba(255, 255, 255, 0.02);
|
|
||||||
font-family: ui-monospace, monospace;
|
|
||||||
font-size: 13px;
|
|
||||||
}
|
|
||||||
|
|
||||||
.event.high { border-left-color: var(--error); }
|
|
||||||
.event.medium { border-left-color: var(--unstable); }
|
|
||||||
.event.low { border-left-color: var(--nominal); }
|
|
||||||
|
|
||||||
.event-header {
|
|
||||||
display: flex;
|
|
||||||
justify-content: space-between;
|
|
||||||
margin-bottom: 4px;
|
|
||||||
color: var(--text-muted);
|
|
||||||
}
|
|
||||||
|
|
||||||
/* Forms & Inputs */
|
|
||||||
.controls {
|
|
||||||
display: flex;
|
|
||||||
gap: 12px;
|
|
||||||
margin-top: 20px;
|
|
||||||
}
|
|
||||||
|
|
||||||
input, button {
|
|
||||||
background: var(--card-color);
|
|
||||||
border: 1px solid var(--border-color);
|
|
||||||
color: var(--text-color);
|
|
||||||
padding: 8px 16px;
|
|
||||||
font-size: 14px;
|
|
||||||
border-radius: 4px;
|
|
||||||
}
|
|
||||||
|
|
||||||
button {
|
|
||||||
cursor: pointer;
|
|
||||||
font-weight: 600;
|
|
||||||
}
|
|
||||||
|
|
||||||
button:hover { background: var(--border-color); }
|
|
||||||
|
|
||||||
.btn-primary { background: var(--accent-color); color: white; border: none; }
|
|
||||||
.btn-primary:hover { background: #359b6d; }
|
|
||||||
|
|
||||||
/* Utility */
|
|
||||||
.hidden { display: none !important; }
|
|
||||||
.mono { font-family: ui-monospace, monospace; }
|
|
||||||
.label { color: var(--text-muted); font-size: 12px; margin-bottom: 4px; }
|
|
||||||
.value { font-weight: 500; margin-bottom: 12px; }
|
|
||||||
|
|
||||||
.risk-safe { background: rgba(62, 175, 124, 0.1); color: var(--safe); }
|
|
||||||
.risk-guarded { background: rgba(230, 126, 34, 0.1); color: var(--guarded); }
|
|
||||||
.risk-dangerous { background: rgba(192, 57, 43, 0.1); color: var(--dangerous); }
|
|
||||||
|
|
||||||
</style>
|
|
||||||
</head>
|
|
||||||
<body>
|
|
||||||
<aside class="sidebar">
|
|
||||||
<div class="sidebar-header">HOMELAB OPERATOR</div>
|
|
||||||
<ul class="nav-list">
|
|
||||||
<li class="nav-item active" onclick="showView('dashboard', this)">
|
|
||||||
<span>Dashboard</span>
|
|
||||||
</li>
|
|
||||||
<li class="nav-item" onclick="showView('actions', this)">
|
|
||||||
<span>Action Queue</span>
|
|
||||||
</li>
|
|
||||||
<li class="nav-item" onclick="showView('nodes', this)">
|
|
||||||
<span>Nodes</span>
|
|
||||||
</li>
|
|
||||||
<li class="nav-item" onclick="showView('services', this)">
|
|
||||||
<span>Services</span>
|
|
||||||
</li>
|
|
||||||
<li class="nav-item" onclick="showView('deployments', this)">
|
|
||||||
<span>Deployments</span>
|
|
||||||
</li>
|
|
||||||
<li class="nav-item" onclick="showView('topology', this)">
|
|
||||||
<span>Topology</span>
|
|
||||||
</li>
|
|
||||||
<li class="nav-item" onclick="showView('events', this)">
|
|
||||||
<span>Events</span>
|
|
||||||
</li>
|
|
||||||
<li class="nav-item" onclick="showView('correlation', this)">
|
|
||||||
<span>Correlation</span>
|
|
||||||
</li>
|
|
||||||
<li class="nav-item" onclick="showView('recommendations', this)">
|
|
||||||
<span>Recommendations</span>
|
|
||||||
</li>
|
|
||||||
<li class="nav-item" onclick="showView('settings', this)">
|
|
||||||
<span>Settings</span>
|
|
||||||
</li>
|
|
||||||
</ul>
|
|
||||||
<div class="sidebar-footer">
|
|
||||||
<div id="summary-status">System Status: Loading...</div>
|
|
||||||
</div>
|
|
||||||
</aside>
|
|
||||||
|
|
||||||
<main class="main-content">
|
|
||||||
<div id="stale-banner" class="hidden" style="background:var(--error); color:white; padding:8px 24px; font-weight:bold; font-size:12px; text-align:center; letter-spacing:0.05em">
|
|
||||||
RUNTIME STATE IS STALE
|
|
||||||
</div>
|
|
||||||
<header>
|
|
||||||
<div style="display:flex; align-items:center; gap:20px">
|
|
||||||
<div class="view-title" id="current-view-title">Dashboard</div>
|
|
||||||
<select id="operator-mode" onchange="setOperatorMode(this.value)" style="background:var(--sidebar-color); border:1px solid var(--border-color); color:var(--accent-color); font-weight:bold; font-size:12px; padding:4px 8px">
|
|
||||||
<option value="observe">OBSERVE</option>
|
|
||||||
<option value="recommend">RECOMMEND</option>
|
|
||||||
<option value="approval" selected>APPROVAL</option>
|
|
||||||
<option value="autonomous">AUTONOMOUS</option>
|
|
||||||
<option value="maintenance">MAINTENANCE</option>
|
|
||||||
</select>
|
|
||||||
</div>
|
|
||||||
<div class="header-actions" style="display:flex; gap:8px; align-items:center">
|
|
||||||
<button onclick="refreshData()">Refresh</button>
|
|
||||||
<button id="copy-ai-btn" onclick="copyForAI()">Copy for AI</button>
|
|
||||||
</div>
|
|
||||||
</header>
|
|
||||||
|
|
||||||
<div class="content-scroll">
|
|
||||||
<!-- Dashboard View -->
|
|
||||||
<div id="view-dashboard" class="view">
|
|
||||||
<div class="grid">
|
|
||||||
<div class="card">
|
|
||||||
<div class="card-title">System Overview</div>
|
|
||||||
<div id="dashboard-summary" style="margin-top:20px"></div>
|
|
||||||
</div>
|
|
||||||
<div class="card">
|
|
||||||
<div class="card-title">Pending Actions</div>
|
|
||||||
<div id="dashboard-actions-summary" style="margin-top:20px"></div>
|
|
||||||
</div>
|
|
||||||
<div class="card">
|
|
||||||
<div class="card-title">Active Incidents</div>
|
|
||||||
<div id="dashboard-incidents" style="margin-top:20px"></div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<!-- Actions View -->
|
|
||||||
<div id="view-actions" class="view hidden">
|
|
||||||
<div style="display:grid; grid-template-columns: 1fr 1fr; gap:24px">
|
|
||||||
<div>
|
|
||||||
<h3>Pending Approval</h3>
|
|
||||||
<div id="actions-pending" class="timeline"></div>
|
|
||||||
</div>
|
|
||||||
<div>
|
|
||||||
<h3>Active / History</h3>
|
|
||||||
<div id="actions-history" class="timeline"></div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<!-- Nodes View -->
|
|
||||||
<div id="view-nodes" class="view hidden">
|
|
||||||
<div class="grid" id="nodes-list"></div>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<!-- Services View -->
|
|
||||||
<div id="view-services" class="view hidden">
|
|
||||||
<div class="grid" id="services-list"></div>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<!-- Deployments View -->
|
|
||||||
<div id="view-deployments" class="view hidden">
|
|
||||||
<div class="grid" id="deployments-list"></div>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<!-- Topology View -->
|
|
||||||
<div id="view-topology" class="view hidden">
|
|
||||||
<div class="card" style="min-height:500px">
|
|
||||||
<div class="card-title">Runtime Topology</div>
|
|
||||||
<div id="topology-map" style="margin-top:20px; display:flex; flex-wrap:wrap; gap:40px; justify-content:center"></div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<!-- Events View -->
|
|
||||||
<div id="view-events" class="view hidden">
|
|
||||||
<div class="timeline" id="events-timeline"></div>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<!-- Correlation View -->
|
|
||||||
<div id="view-correlation" class="view hidden">
|
|
||||||
<div id="correlation-chains" class="grid"></div>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<!-- Recommendations View -->
|
|
||||||
<div id="view-recommendations" class="view hidden">
|
|
||||||
<div class="grid" id="recommendations-list"></div>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<!-- Settings View -->
|
|
||||||
<div id="view-settings" class="view hidden">
|
|
||||||
<div class="card">
|
|
||||||
<div class="card-title">Configuration</div>
|
|
||||||
<div id="settings-content" style="margin-top:20px"></div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
</main>
|
|
||||||
|
|
||||||
<script>
|
|
||||||
let currentView = 'dashboard';
|
|
||||||
const pollInterval = 5000;
|
|
||||||
|
|
||||||
function showView(viewId, el) {
|
|
||||||
document.querySelectorAll('.view').forEach(v => v.classList.add('hidden'));
|
|
||||||
document.getElementById('view-' + viewId).classList.remove('hidden');
|
|
||||||
document.querySelectorAll('.nav-item').forEach(i => i.classList.remove('active'));
|
|
||||||
if (el) el.classList.add('active');
|
|
||||||
currentView = viewId;
|
|
||||||
document.getElementById('current-view-title').textContent = viewId.charAt(0).toUpperCase() + viewId.slice(1);
|
|
||||||
refreshData();
|
|
||||||
}
|
|
||||||
|
|
||||||
async function fetchData(endpoint) {
|
|
||||||
try {
|
|
||||||
const res = await fetch(endpoint, {cache: 'no-store'});
|
|
||||||
return await res.json();
|
|
||||||
} catch (e) {
|
|
||||||
console.error('Fetch error:', endpoint, e);
|
|
||||||
return null;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
async function postData(endpoint, data) {
|
|
||||||
try {
|
|
||||||
const res = await fetch(endpoint, {
|
|
||||||
method: 'POST',
|
|
||||||
headers: {'Content-Type': 'application/json'},
|
|
||||||
body: JSON.stringify(data)
|
|
||||||
});
|
|
||||||
return await res.json();
|
|
||||||
} catch (e) {
|
|
||||||
console.error('Post error:', endpoint, e);
|
|
||||||
return null;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
async function mutateAction(id, status) {
|
|
||||||
const res = await postData('/action/mutate', {id, status});
|
|
||||||
if (res && res.status === 'ok') {
|
|
||||||
refreshData();
|
|
||||||
} else {
|
|
||||||
alert('Mutation failed');
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
async function setOperatorMode(mode) {
|
|
||||||
console.log('Operator mode set to:', mode);
|
|
||||||
const res = await postData('/mode', {mode});
|
|
||||||
if (res && res.status === 'ok') {
|
|
||||||
console.log('Mode updated successfully');
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
function formatTime(ts) {
|
|
||||||
if (!ts) return 'N/A';
|
|
||||||
return new Date(ts * 1000).toLocaleString();
|
|
||||||
}
|
|
||||||
|
|
||||||
function getStatusClass(status) {
|
|
||||||
status = (status || '').toLowerCase();
|
|
||||||
if (['nominal', 'healthy', 'ok', 'up'].includes(status)) return 'status-nominal';
|
|
||||||
if (['degraded', 'warning'].includes(status)) return 'status-degraded';
|
|
||||||
if (['unstable'].includes(status)) return 'status-unstable';
|
|
||||||
if (['reconciling'].includes(status)) return 'status-reconciling';
|
|
||||||
if (['error', 'down', 'failed'].includes(status)) return 'status-error';
|
|
||||||
return '';
|
|
||||||
}
|
|
||||||
|
|
||||||
async function refreshData() {
|
|
||||||
// Refresh summary always
|
|
||||||
const summary = await fetchData('/summary');
|
|
||||||
if (summary) {
|
|
||||||
const statusEl = document.getElementById('summary-status');
|
|
||||||
statusEl.textContent = `System Status: ${summary.status.toUpperCase()}`;
|
|
||||||
statusEl.className = 'sidebar-footer ' + getStatusClass(summary.status);
|
|
||||||
|
|
||||||
// Handle stale state
|
|
||||||
const staleBanner = document.getElementById('stale-banner');
|
|
||||||
if (summary.stale) {
|
|
||||||
staleBanner.classList.remove('hidden');
|
|
||||||
staleBanner.textContent = `CRITICAL: Runtime state is STALE (Last update: ${formatTime(summary.last_update)})`;
|
|
||||||
} else {
|
|
||||||
staleBanner.classList.add('hidden');
|
|
||||||
}
|
|
||||||
|
|
||||||
if (currentView === 'dashboard') {
|
|
||||||
const dashSummary = document.getElementById('dashboard-summary');
|
|
||||||
dashSummary.innerHTML = `
|
|
||||||
<div class="label">Nodes</div><div class="value">${summary.node_count}</div>
|
|
||||||
<div class="label">Services</div><div class="value">${summary.service_count}</div>
|
|
||||||
<div class="label">Last Update</div><div class="value">${formatTime(summary.last_update)}</div>
|
|
||||||
`;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
if (currentView === 'dashboard' || currentView === 'actions') {
|
|
||||||
const actions = await fetchData('/actions');
|
|
||||||
if (actions) {
|
|
||||||
if (currentView === 'dashboard') {
|
|
||||||
const dashActions = document.getElementById('dashboard-actions-summary');
|
|
||||||
const pendingCount = actions.pending.length;
|
|
||||||
dashActions.innerHTML = `
|
|
||||||
<div class="label">Pending</div><div class="value" style="color:var(--guarded)">${pendingCount}</div>
|
|
||||||
<div class="label">Running</div><div class="value" style="color:var(--reconciling)">${actions.running.length}</div>
|
|
||||||
`;
|
|
||||||
}
|
|
||||||
if (currentView === 'actions') {
|
|
||||||
const pendingEl = document.getElementById('actions-pending');
|
|
||||||
const historyEl = document.getElementById('actions-history');
|
|
||||||
|
|
||||||
pendingEl.innerHTML = actions.pending.map(a => `
|
|
||||||
<div class="card" style="margin-bottom:12px">
|
|
||||||
<div class="card-header">
|
|
||||||
<div class="card-title">${(a.action_type || a.type || 'unknown').toUpperCase()}</div>
|
|
||||||
<span class="badge risk-${a.risk_level}">${a.risk_level}</span>
|
|
||||||
</div>
|
|
||||||
<p>${a.description || a.action_type || 'No description'}</p>
|
|
||||||
<div class="label">Target</div><div class="value">${a.node || (a.target && a.target.node) || 'unknown'} ${(a.service || (a.target && a.target.service)) || ''}</div>
|
|
||||||
<div class="label">Confidence</div><div class="value">${Math.round((a.confidence || 0)*100)}%</div>
|
|
||||||
<div class="controls">
|
|
||||||
<button class="btn-primary" onclick="mutateAction('${a.id}', 'approved')">Approve</button>
|
|
||||||
<button onclick="mutateAction('${a.id}', 'rejected')">Reject</button>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
`).join('') || 'No pending actions.';
|
|
||||||
|
|
||||||
const history = [...actions.approved, ...actions.running, ...actions.completed, ...actions.failed, ...actions.rejected];
|
|
||||||
historyEl.innerHTML = history.sort((a,b) => (b.timestamp || b.updated_at || 0) - (a.timestamp || a.updated_at || 0)).map(a => `
|
|
||||||
<div class="event">
|
|
||||||
<div class="event-header">
|
|
||||||
<span>${(a.action_type || a.type || 'unknown').toUpperCase()}</span>
|
|
||||||
<span class="badge ${getStatusClass(a.status)}">${a.status}</span>
|
|
||||||
</div>
|
|
||||||
<div>${a.description || a.action_type || 'No description'}</div>
|
|
||||||
<small>${formatTime(a.timestamp || a.updated_at)} | Target: ${a.node || (a.target && a.target.node)}</small>
|
|
||||||
${a.status === 'approved' ? `<div class="controls"><button class="btn-primary" onclick="mutateAction('${a.id}', 'running')">Execute</button></div>` : ''}
|
|
||||||
${a.transition_history ? `
|
|
||||||
<div style="margin-top:8px; font-size:10px; color:var(--text-muted)">
|
|
||||||
<strong>Trace:</strong> ${a.transition_history.map(h => `${h.from}->${h.to}`).join(' → ')}
|
|
||||||
</div>
|
|
||||||
` : ''}
|
|
||||||
</div>
|
|
||||||
`).join('') || 'No history.';
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
if (currentView === 'dashboard' || currentView === 'events') {
|
|
||||||
const incidents = await fetchData('/incidents');
|
|
||||||
if (currentView === 'dashboard') {
|
|
||||||
const dashIncidents = document.getElementById('dashboard-incidents');
|
|
||||||
if (!incidents || incidents.length === 0) {
|
|
||||||
dashIncidents.textContent = 'No active incidents.';
|
|
||||||
} else {
|
|
||||||
dashIncidents.innerHTML = incidents.map(inc => `
|
|
||||||
<div class="event ${inc.severity}">
|
|
||||||
<strong>${inc.severity.toUpperCase()}:</strong> ${inc.message}<br>
|
|
||||||
<small>${formatTime(inc.timestamp)} | Node: ${inc.node}</small>
|
|
||||||
</div>
|
|
||||||
`).join('');
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
if (currentView === 'nodes') {
|
|
||||||
const nodes = await fetchData('/nodes');
|
|
||||||
const list = document.getElementById('nodes-list');
|
|
||||||
list.innerHTML = nodes.map(node => `
|
|
||||||
<div class="card">
|
|
||||||
<div class="card-header">
|
|
||||||
<div class="card-title">${node.hostname}</div>
|
|
||||||
<span class="badge ${getStatusClass(node.health)}">${node.health}</span>
|
|
||||||
</div>
|
|
||||||
<div class="label">ID</div><div class="value mono">${node.id}</div>
|
|
||||||
<div class="label">Capabilities</div><div class="value">${node.capabilities.join(', ')}</div>
|
|
||||||
<div class="label">Connectivity</div><div class="value">${node.connectivity}</div>
|
|
||||||
<div class="label">Incidents (24h)</div><div class="value">${node.incidents}</div>
|
|
||||||
<div class="label">Last Seen</div><div class="value">${formatTime(node.last_seen)}</div>
|
|
||||||
<div class="label">Runtime Status</div><div class="value">${node.status}</div>
|
|
||||||
</div>
|
|
||||||
`).join('');
|
|
||||||
}
|
|
||||||
|
|
||||||
if (currentView === 'services') {
|
|
||||||
const services = await fetchData('/services');
|
|
||||||
const list = document.getElementById('services-list');
|
|
||||||
list.innerHTML = services.map(svc => `
|
|
||||||
<div class="card">
|
|
||||||
<div class="card-header">
|
|
||||||
<div class="card-title">${svc.name}</div>
|
|
||||||
<span class="badge ${getStatusClass(svc.health)}">${svc.health}</span>
|
|
||||||
</div>
|
|
||||||
<div class="label">State (Desired/Actual)</div><div class="value">${svc.desired_state} / ${svc.actual_state}</div>
|
|
||||||
<div class="label">Deployment</div><div class="value">${svc.deployment_state}</div>
|
|
||||||
<div class="label">Dependencies</div><div class="value">${svc.dependencies.join(', ') || 'None'}</div>
|
|
||||||
<div class="label">Recommendations</div><div class="value">${svc.recommendations.join(', ') || 'None'}</div>
|
|
||||||
</div>
|
|
||||||
`).join('');
|
|
||||||
}
|
|
||||||
|
|
||||||
if (currentView === 'deployments') {
|
|
||||||
const deps = await fetchData('/deployments');
|
|
||||||
const list = document.getElementById('deployments-list');
|
|
||||||
list.innerHTML = deps.map(dep => `
|
|
||||||
<div class="card">
|
|
||||||
<div class="card-header">
|
|
||||||
<div class="card-title">${dep.service}</div>
|
|
||||||
<span class="badge ${dep.status === 'failed' ? 'status-error' : 'status-reconciling'}">${dep.status}</span>
|
|
||||||
</div>
|
|
||||||
<div class="label">ID</div><div class="value mono">${dep.id}</div>
|
|
||||||
<div class="label">Stage</div><div class="value">${dep.stage}</div>
|
|
||||||
<div class="label">Diagnostics</div><div class="value">${dep.diagnostics || 'No data'}</div>
|
|
||||||
<div class="label">Resumable</div><div class="value">${dep.resumable ? 'Yes' : 'No'}</div>
|
|
||||||
${dep.resumable ? '<button class="btn-primary">Resume</button>' : ''}
|
|
||||||
</div>
|
|
||||||
`).join('');
|
|
||||||
}
|
|
||||||
|
|
||||||
if (currentView === 'events') {
|
|
||||||
const events = await fetchData('/events');
|
|
||||||
const timeline = document.getElementById('events-timeline');
|
|
||||||
timeline.innerHTML = events.map(ev => `
|
|
||||||
<div class="event ${ev.severity}">
|
|
||||||
<div class="event-header">
|
|
||||||
<span>${ev.type.toUpperCase()}</span>
|
|
||||||
<span>${formatTime(ev.timestamp)}</span>
|
|
||||||
</div>
|
|
||||||
<div>${ev.message}</div>
|
|
||||||
<div class="label" style="margin-top:8px">Node: ${ev.node} ${ev.service ? '| Service: ' + ev.service : ''}</div>
|
|
||||||
</div>
|
|
||||||
`).join('');
|
|
||||||
}
|
|
||||||
|
|
||||||
if (currentView === 'recommendations') {
|
|
||||||
const recs = await fetchData('/recommendations');
|
|
||||||
const list = document.getElementById('recommendations-list');
|
|
||||||
list.innerHTML = recs.map(rec => `
|
|
||||||
<div class="card">
|
|
||||||
<div class="card-header">
|
|
||||||
<div class="card-title">${rec.title}</div>
|
|
||||||
<span class="badge risk-${rec.risk_level}">${rec.risk_level}</span>
|
|
||||||
</div>
|
|
||||||
<p>${rec.description}</p>
|
|
||||||
<div class="label">Confidence</div><div class="value">${Math.round(rec.confidence * 100)}%</div>
|
|
||||||
<div class="label">Autonomous Eligible</div><div class="value">${rec.autonomous_eligible ? 'Yes' : 'No'}</div>
|
|
||||||
<div class="label">Blocked Actions</div><div class="value">${rec.blocked_actions.join(', ') || 'None'}</div>
|
|
||||||
<div class="controls">
|
|
||||||
<button class="btn-primary" ${rec.risk_level === 'dangerous' ? 'style="background:var(--dangerous)"' : ''}>Approve Action</button>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
`).join('');
|
|
||||||
}
|
|
||||||
|
|
||||||
if (currentView === 'topology') {
|
|
||||||
const nodes = await fetchData('/nodes');
|
|
||||||
const services = await fetchData('/services');
|
|
||||||
const topMap = document.getElementById('topology-map');
|
|
||||||
if (nodes && services) {
|
|
||||||
topMap.innerHTML = nodes.map(node => {
|
|
||||||
const nodeServices = services.filter(s => s.node === node.hostname || s.node === node.id);
|
|
||||||
return `
|
|
||||||
<div class="card" style="width:250px; border: 1px solid ${node.health === 'nominal' ? 'var(--border-color)' : 'var(--error)'}">
|
|
||||||
<div class="card-header">
|
|
||||||
<div class="card-title">${node.hostname}</div>
|
|
||||||
<span class="badge ${getStatusClass(node.health)}">${node.health}</span>
|
|
||||||
</div>
|
|
||||||
<div class="label">Capabilities</div>
|
|
||||||
<div class="value" style="font-size:11px">${node.capabilities.join(', ')}</div>
|
|
||||||
<div class="label">Services</div>
|
|
||||||
<div style="font-size:12px; margin-bottom:10px">
|
|
||||||
${nodeServices.length > 0 ? nodeServices.map(s => `
|
|
||||||
<div style="display:flex; justify-content:space-between; margin-bottom:4px; padding:4px; background:rgba(255,255,255,0.03)">
|
|
||||||
<span>${s.name}</span>
|
|
||||||
<span class="${getStatusClass(s.health)}" style="font-size:10px">${s.health}</span>
|
|
||||||
</div>
|
|
||||||
${s.dependencies.length > 0 ? `<div style="font-size:9px; color:var(--text-muted); margin-left:8px; margin-bottom:4px">dep: ${s.dependencies.join(', ')}</div>` : ''}
|
|
||||||
`).join('') : '<div class="value">None</div>'}
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
`;
|
|
||||||
}).join('');
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
if (currentView === 'correlation') {
|
|
||||||
const incidents = await fetchData('/incidents');
|
|
||||||
const actions = await fetchData('/actions');
|
|
||||||
const list = document.getElementById('correlation-chains');
|
|
||||||
if (incidents && actions) {
|
|
||||||
const allActions = Object.values(actions).flat();
|
|
||||||
list.innerHTML = incidents.map(inc => {
|
|
||||||
const related = allActions.filter(a => a.correlation_chain && a.correlation_chain.includes(inc.id));
|
|
||||||
return `
|
|
||||||
<div class="card">
|
|
||||||
<div class="card-header">
|
|
||||||
<div class="card-title">Incident: ${inc.id || 'INC-001'}</div>
|
|
||||||
<span class="badge status-error">Active</span>
|
|
||||||
</div>
|
|
||||||
<p>${inc.message}</p>
|
|
||||||
<div class="label">Related Actions</div>
|
|
||||||
${related.map(a => `
|
|
||||||
<div class="event" style="margin-top:5px">
|
|
||||||
<strong>${a.type}</strong> (${a.status})<br>
|
|
||||||
<small>${a.description}</small>
|
|
||||||
</div>
|
|
||||||
`).join('') || '<div class="value">No actions yet</div>'}
|
|
||||||
</div>
|
|
||||||
`;
|
|
||||||
}).join('');
|
|
||||||
}
|
|
||||||
}
|
|
||||||
if (currentView === 'settings') {
|
|
||||||
const config = await fetchData('/config');
|
|
||||||
const content = document.getElementById('settings-content');
|
|
||||||
content.innerHTML = `
|
|
||||||
<div class="label">Auto Mode</div>
|
|
||||||
<div class="value">${config.auto_mode ? 'Enabled' : 'Disabled'}</div>
|
|
||||||
<div class="label">Action Thresholds</div>
|
|
||||||
<div class="value mono">${JSON.stringify(config.action_thresholds, null, 2)}</div>
|
|
||||||
<div class="label">Telegram Integration</div>
|
|
||||||
<div class="value" style="color:var(--text-muted)">Ready for mobile approval flows. Hook: /api/v1/telegram/webhook</div>
|
|
||||||
<button onclick="alert('Settings update not implemented in this demo')">Edit Configuration</button>
|
|
||||||
`;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
async function copyForAI() {
|
|
||||||
const btn = document.getElementById('copy-ai-btn');
|
|
||||||
const original = btn.textContent;
|
|
||||||
btn.textContent = 'Copying...';
|
|
||||||
btn.disabled = true;
|
|
||||||
|
|
||||||
try {
|
|
||||||
const snap = await fetchData('/snapshot');
|
|
||||||
if (!snap) throw new Error('snapshot fetch failed');
|
|
||||||
|
|
||||||
const now = new Date(snap.timestamp);
|
|
||||||
const dateStr = now.toISOString().slice(0, 16).replace('T', ' ');
|
|
||||||
const lines = [];
|
|
||||||
|
|
||||||
lines.push(`=== HOMELAB SNAPSHOT ${dateStr} ===`);
|
|
||||||
|
|
||||||
if (snap.nodes && snap.nodes.length > 0) {
|
|
||||||
lines.push('NODES: ' + snap.nodes.map(n =>
|
|
||||||
`${(n.hostname || n.id || '?').toUpperCase()} ${(n.health || 'unknown').toUpperCase()}`
|
|
||||||
).join(', '));
|
|
||||||
} else {
|
|
||||||
lines.push('NODES: none');
|
|
||||||
}
|
|
||||||
|
|
||||||
if (snap.non_nominal_services && snap.non_nominal_services.length > 0) {
|
|
||||||
lines.push('ERRORS: ' + snap.non_nominal_services.map(s =>
|
|
||||||
`${s.name} (${s.node}) - ${s.health}`
|
|
||||||
).join(', '));
|
|
||||||
} else {
|
|
||||||
lines.push(`ERRORS: none (${snap.nominal_service_count} nominal)`);
|
|
||||||
}
|
|
||||||
|
|
||||||
const activeIncidents = (snap.incidents || []).filter(i => !['resolved', 'closed'].includes(i.status));
|
|
||||||
if (activeIncidents.length > 0) {
|
|
||||||
lines.push('INCIDENTS: ' + activeIncidents.map(i =>
|
|
||||||
`[${i.severity}] ${i.message} (${i.node})`
|
|
||||||
).join('; '));
|
|
||||||
} else {
|
|
||||||
lines.push('INCIDENTS: none');
|
|
||||||
}
|
|
||||||
|
|
||||||
if (snap.events && snap.events.length > 0) {
|
|
||||||
lines.push(`EVENTS (last ${snap.events.length}):`);
|
|
||||||
snap.events.forEach(ev => {
|
|
||||||
const ts = ev.timestamp
|
|
||||||
? new Date(ev.timestamp * 1000).toISOString().slice(11, 19)
|
|
||||||
: '?';
|
|
||||||
const svc = ev.service ? '/' + ev.service : '';
|
|
||||||
lines.push(` ${ts} [${ev.severity || ev.level || '?'}] ${ev.type} - ${ev.message || ''} (${ev.node || ''}${svc})`);
|
|
||||||
});
|
|
||||||
} else {
|
|
||||||
lines.push('EVENTS (last 10): none');
|
|
||||||
}
|
|
||||||
|
|
||||||
const s = snap.summary || {};
|
|
||||||
lines.push(`SUMMARY: status=${s.status || '?'} nodes=${s.node_count ?? '?'} services=${s.service_count ?? '?'} incidents=${s.incident_count ?? '?'}`);
|
|
||||||
|
|
||||||
await navigator.clipboard.writeText(lines.join('\n'));
|
|
||||||
btn.textContent = 'Copied!';
|
|
||||||
setTimeout(() => { btn.textContent = original; btn.disabled = false; }, 2000);
|
|
||||||
} catch (e) {
|
|
||||||
console.error('copyForAI error:', e);
|
|
||||||
btn.textContent = 'Error';
|
|
||||||
setTimeout(() => { btn.textContent = original; btn.disabled = false; }, 2000);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// Initial load
|
|
||||||
refreshData();
|
|
||||||
// Poll for updates
|
|
||||||
setInterval(refreshData, pollInterval);
|
|
||||||
|
|
||||||
</script>
|
|
||||||
</body>
|
|
||||||
</html>
|
|
||||||
|
|
@ -1,301 +0,0 @@
|
||||||
import json
|
|
||||||
import os
|
|
||||||
import time
|
|
||||||
from datetime import datetime, timezone
|
|
||||||
from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
|
|
||||||
STATE_DIR = Path(os.getenv("HOMELAB_STATE_ROOT", "/opt/homelab/state"))
|
|
||||||
EVENTS_DIR = Path(os.getenv("HOMELAB_EVENTS_ROOT", "/opt/homelab/events"))
|
|
||||||
WORLD_DIR = Path(os.getenv("HOMELAB_WORLD_ROOT", "/opt/homelab/world"))
|
|
||||||
ACTIONS_DIR = Path(os.getenv("HOMELAB_ACTIONS_ROOT", "/opt/homelab/actions"))
|
|
||||||
CONFIG_DIR = Path(os.getenv("HOMELAB_CONFIG_ROOT", "/opt/homelab/config"))
|
|
||||||
|
|
||||||
STATIC_DIR = Path(__file__).parent
|
|
||||||
|
|
||||||
DEFAULT_CONFIG = {
|
|
||||||
"operator_mode": "approval",
|
|
||||||
"auto_mode": True,
|
|
||||||
"action_thresholds": {
|
|
||||||
"restart_ha": 0.8,
|
|
||||||
"check_network": 0.9,
|
|
||||||
},
|
|
||||||
"default_threshold": 0.9,
|
|
||||||
"allowed_auto_actions": ["restart_ha"],
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
def read_json_file(path, default=None):
|
|
||||||
if not path.exists():
|
|
||||||
return default if default is not None else []
|
|
||||||
try:
|
|
||||||
return json.loads(path.read_text())
|
|
||||||
except Exception:
|
|
||||||
return default if default is not None else []
|
|
||||||
|
|
||||||
|
|
||||||
def get_config():
|
|
||||||
config_path = STATE_DIR / "operator-config.json"
|
|
||||||
if config_path.exists():
|
|
||||||
return read_json_file(config_path, DEFAULT_CONFIG)
|
|
||||||
return DEFAULT_CONFIG
|
|
||||||
|
|
||||||
|
|
||||||
def save_config(config):
|
|
||||||
STATE_DIR.mkdir(parents=True, exist_ok=True)
|
|
||||||
(STATE_DIR / "operator-config.json").write_text(json.dumps(config, indent=2))
|
|
||||||
|
|
||||||
|
|
||||||
def current_nodes():
|
|
||||||
return read_json_file(WORLD_DIR / "nodes.json")
|
|
||||||
|
|
||||||
|
|
||||||
def current_services():
|
|
||||||
return read_json_file(WORLD_DIR / "services.json")
|
|
||||||
|
|
||||||
|
|
||||||
def current_deployments():
|
|
||||||
return read_json_file(WORLD_DIR / "deployments.json")
|
|
||||||
|
|
||||||
|
|
||||||
def current_incidents():
|
|
||||||
return read_json_file(WORLD_DIR / "incidents.json")
|
|
||||||
|
|
||||||
|
|
||||||
def current_recommendations():
|
|
||||||
return read_json_file(WORLD_DIR / "recommendations.json")
|
|
||||||
|
|
||||||
|
|
||||||
def current_summary():
|
|
||||||
path = WORLD_DIR / "runtime-summary.json"
|
|
||||||
summary = read_json_file(path, default={})
|
|
||||||
if summary:
|
|
||||||
last_update_val = summary.get("last_update")
|
|
||||||
if last_update_val:
|
|
||||||
try:
|
|
||||||
if isinstance(last_update_val, str):
|
|
||||||
last_update = datetime.fromisoformat(last_update_val.replace('Z', '+00:00')).timestamp()
|
|
||||||
else:
|
|
||||||
last_update = float(last_update_val)
|
|
||||||
except Exception:
|
|
||||||
last_update = os.path.getmtime(path)
|
|
||||||
else:
|
|
||||||
last_update = os.path.getmtime(path)
|
|
||||||
summary["last_update"] = last_update
|
|
||||||
summary["stale"] = (time.time() - last_update) > 60
|
|
||||||
return summary
|
|
||||||
|
|
||||||
|
|
||||||
def current_events():
|
|
||||||
return read_json_file(WORLD_DIR / "events.json", default=[])
|
|
||||||
|
|
||||||
|
|
||||||
def current_actions():
|
|
||||||
actions = {}
|
|
||||||
statuses = ["pending", "approved", "running", "completed", "failed", "rejected"]
|
|
||||||
for status in statuses:
|
|
||||||
actions[status] = []
|
|
||||||
status_dir = ACTIONS_DIR / status
|
|
||||||
if status_dir.exists():
|
|
||||||
for f in status_dir.glob("*.json"):
|
|
||||||
data = read_json_file(f)
|
|
||||||
if data:
|
|
||||||
# Injects some metadata for UI
|
|
||||||
data["id"] = data.get("action_id") or f.stem
|
|
||||||
data["status"] = status
|
|
||||||
actions[status].append(data)
|
|
||||||
return actions
|
|
||||||
|
|
||||||
|
|
||||||
def mutate_action(action_id, target_status):
|
|
||||||
statuses = ["pending", "approved", "running", "completed", "failed", "rejected"]
|
|
||||||
if target_status not in statuses:
|
|
||||||
return False, f"Invalid target status: {target_status}"
|
|
||||||
|
|
||||||
# Find where the action is
|
|
||||||
source_path = None
|
|
||||||
current_status = None
|
|
||||||
for status in statuses:
|
|
||||||
p = ACTIONS_DIR / status / f"{action_id}.json"
|
|
||||||
if p.exists():
|
|
||||||
source_path = p
|
|
||||||
current_status = status
|
|
||||||
break
|
|
||||||
|
|
||||||
if not source_path:
|
|
||||||
return False, f"Action {action_id} not found"
|
|
||||||
|
|
||||||
target_dir = ACTIONS_DIR / target_status
|
|
||||||
target_dir.mkdir(parents=True, exist_ok=True)
|
|
||||||
target_path = target_dir / f"{action_id}.json"
|
|
||||||
|
|
||||||
try:
|
|
||||||
data = json.loads(source_path.read_text())
|
|
||||||
data["status"] = target_status
|
|
||||||
data["updated_at"] = time.time()
|
|
||||||
|
|
||||||
# Keep history of transitions
|
|
||||||
history = data.get("transition_history", [])
|
|
||||||
history.append({
|
|
||||||
"from": current_status,
|
|
||||||
"to": target_status,
|
|
||||||
"timestamp": time.time()
|
|
||||||
})
|
|
||||||
data["transition_history"] = history
|
|
||||||
|
|
||||||
target_path.write_text(json.dumps(data, indent=2))
|
|
||||||
if source_path != target_path:
|
|
||||||
source_path.unlink()
|
|
||||||
return True, "Success"
|
|
||||||
except Exception as e:
|
|
||||||
return False, str(e)
|
|
||||||
|
|
||||||
|
|
||||||
def get_snapshot():
|
|
||||||
nodes = current_nodes()
|
|
||||||
services = current_services()
|
|
||||||
incidents = current_incidents()
|
|
||||||
events = current_events()
|
|
||||||
summary = current_summary()
|
|
||||||
|
|
||||||
non_nominal = [s for s in services if s.get("health") != "nominal"]
|
|
||||||
nominal_count = len(services) - len(non_nominal)
|
|
||||||
|
|
||||||
return {
|
|
||||||
"timestamp": datetime.now(timezone.utc).isoformat(),
|
|
||||||
"summary": summary,
|
|
||||||
"nodes": nodes,
|
|
||||||
"non_nominal_services": non_nominal,
|
|
||||||
"nominal_service_count": nominal_count,
|
|
||||||
"total_service_count": len(services),
|
|
||||||
"incidents": incidents,
|
|
||||||
"events": events[:10],
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
def send_json(status, payload, handler):
|
|
||||||
body = (json.dumps(payload) + "\n").encode("utf-8")
|
|
||||||
handler.send_response(status)
|
|
||||||
handler.send_header("Content-Type", "application/json")
|
|
||||||
handler.send_header("Content-Length", str(len(body)))
|
|
||||||
handler.end_headers()
|
|
||||||
handler.wfile.write(body)
|
|
||||||
|
|
||||||
|
|
||||||
class Handler(BaseHTTPRequestHandler):
|
|
||||||
def do_GET(self):
|
|
||||||
if self.path == "/config":
|
|
||||||
send_json(200, get_config(), self)
|
|
||||||
return
|
|
||||||
|
|
||||||
if self.path == "/nodes":
|
|
||||||
send_json(200, current_nodes(), self)
|
|
||||||
return
|
|
||||||
|
|
||||||
if self.path == "/services":
|
|
||||||
send_json(200, current_services(), self)
|
|
||||||
return
|
|
||||||
|
|
||||||
if self.path == "/deployments":
|
|
||||||
send_json(200, current_deployments(), self)
|
|
||||||
return
|
|
||||||
|
|
||||||
if self.path == "/incidents":
|
|
||||||
send_json(200, current_incidents(), self)
|
|
||||||
return
|
|
||||||
|
|
||||||
if self.path == "/recommendations":
|
|
||||||
send_json(200, current_recommendations(), self)
|
|
||||||
return
|
|
||||||
|
|
||||||
if self.path == "/summary":
|
|
||||||
send_json(200, current_summary(), self)
|
|
||||||
return
|
|
||||||
|
|
||||||
if self.path == "/events":
|
|
||||||
send_json(200, current_events(), self)
|
|
||||||
return
|
|
||||||
|
|
||||||
if self.path == "/actions":
|
|
||||||
send_json(200, current_actions(), self)
|
|
||||||
return
|
|
||||||
|
|
||||||
if self.path == "/snapshot":
|
|
||||||
send_json(200, get_snapshot(), self)
|
|
||||||
return
|
|
||||||
|
|
||||||
if self.path in ("/", "/index.html"):
|
|
||||||
body = (STATIC_DIR / "index.html").read_bytes()
|
|
||||||
self.send_response(200)
|
|
||||||
self.send_header("Content-Type", "text/html; charset=utf-8")
|
|
||||||
self.send_header("Content-Length", str(len(body)))
|
|
||||||
self.end_headers()
|
|
||||||
self.wfile.write(body)
|
|
||||||
return
|
|
||||||
|
|
||||||
self.send_error(404)
|
|
||||||
|
|
||||||
def do_POST(self):
|
|
||||||
if self.path not in (
|
|
||||||
"/config",
|
|
||||||
"/action/mutate",
|
|
||||||
"/mode",
|
|
||||||
):
|
|
||||||
self.send_error(404)
|
|
||||||
return
|
|
||||||
|
|
||||||
length = int(self.headers.get("Content-Length", "0"))
|
|
||||||
raw_body = self.rfile.read(length).decode("utf-8")
|
|
||||||
try:
|
|
||||||
payload = json.loads(raw_body)
|
|
||||||
except json.JSONDecodeError:
|
|
||||||
self.send_error(400, "Invalid JSON")
|
|
||||||
return
|
|
||||||
|
|
||||||
if self.path == "/config":
|
|
||||||
config = get_config()
|
|
||||||
config.update(payload)
|
|
||||||
save_config(config)
|
|
||||||
send_json(200, {"status": "ok"}, self)
|
|
||||||
return
|
|
||||||
|
|
||||||
if self.path == "/mode":
|
|
||||||
mode = payload.get("mode")
|
|
||||||
if not mode:
|
|
||||||
self.send_error(400, "mode is required")
|
|
||||||
return
|
|
||||||
config = get_config()
|
|
||||||
config["operator_mode"] = mode
|
|
||||||
save_config(config)
|
|
||||||
send_json(200, {"status": "ok"}, self)
|
|
||||||
return
|
|
||||||
|
|
||||||
if self.path == "/action/mutate":
|
|
||||||
action_id = payload.get("id")
|
|
||||||
target = payload.get("status")
|
|
||||||
if not action_id or not target:
|
|
||||||
self.send_error(400, "id and status are required")
|
|
||||||
return
|
|
||||||
success, msg = mutate_action(action_id, target)
|
|
||||||
if success:
|
|
||||||
send_json(200, {"status": "ok"}, self)
|
|
||||||
else:
|
|
||||||
self.send_error(500, msg)
|
|
||||||
return
|
|
||||||
|
|
||||||
def log_message(self, format, *args):
|
|
||||||
return
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
# Ensure directories exist
|
|
||||||
for d in [STATE_DIR, EVENTS_DIR, WORLD_DIR, ACTIONS_DIR, CONFIG_DIR]:
|
|
||||||
d.mkdir(parents=True, exist_ok=True)
|
|
||||||
for s in ["pending", "approved", "running", "completed", "failed", "rejected"]:
|
|
||||||
(ACTIONS_DIR / s).mkdir(parents=True, exist_ok=True)
|
|
||||||
|
|
||||||
port = int(os.getenv("PORT", "8080"))
|
|
||||||
print(f"Operator Control Plane starting on 0.0.0.0:{port}")
|
|
||||||
server = ThreadingHTTPServer(("0.0.0.0", port), Handler)
|
|
||||||
server.serve_forever()
|
|
||||||
|
|
@ -1,10 +0,0 @@
|
||||||
FROM python:3.11-slim
|
|
||||||
|
|
||||||
WORKDIR /app
|
|
||||||
|
|
||||||
COPY src/ src/
|
|
||||||
|
|
||||||
ENV PYTHONUNBUFFERED=1
|
|
||||||
ENV PYTHONPATH=/app/src
|
|
||||||
|
|
||||||
CMD ["python", "-m", "brain_watchdog.main"]
|
|
||||||
|
|
@ -1,30 +0,0 @@
|
||||||
services:
|
|
||||||
brain-watchdog:
|
|
||||||
build: .
|
|
||||||
container_name: brain-watchdog
|
|
||||||
restart: unless-stopped
|
|
||||||
|
|
||||||
env_file:
|
|
||||||
- /opt/homelab/config/brain-watchdog/.env
|
|
||||||
|
|
||||||
volumes:
|
|
||||||
- brain_watchdog_data:/data
|
|
||||||
|
|
||||||
healthcheck:
|
|
||||||
test:
|
|
||||||
- "CMD"
|
|
||||||
- "python"
|
|
||||||
- "-c"
|
|
||||||
- |
|
|
||||||
import os, time, json, sys
|
|
||||||
p = '/data/state.json'
|
|
||||||
if not os.path.exists(p): sys.exit(1)
|
|
||||||
age = time.time() - os.path.getmtime(p)
|
|
||||||
sys.exit(0 if age < 300 else 1)
|
|
||||||
interval: 1m
|
|
||||||
timeout: 10s
|
|
||||||
retries: 3
|
|
||||||
start_period: 30s
|
|
||||||
|
|
||||||
volumes:
|
|
||||||
brain_watchdog_data:
|
|
||||||
|
|
@ -1,7 +0,0 @@
|
||||||
CONTROL_PLANE_URL=
|
|
||||||
STALE_THRESHOLD=600
|
|
||||||
INTERVAL=60
|
|
||||||
FAILS_BEFORE_ALERT=3
|
|
||||||
TG_TOKEN=
|
|
||||||
TG_CHAT_ID=
|
|
||||||
HEALTHCHECKS_URL=
|
|
||||||
|
|
@ -1,10 +0,0 @@
|
||||||
#!/bin/sh
|
|
||||||
# Healthy if state.json was written within the last 5 minutes.
|
|
||||||
python -c "
|
|
||||||
import os, time, sys
|
|
||||||
p = '/data/state.json'
|
|
||||||
if not os.path.exists(p):
|
|
||||||
sys.exit(1)
|
|
||||||
age = time.time() - os.path.getmtime(p)
|
|
||||||
sys.exit(0 if age < 300 else 1)
|
|
||||||
"
|
|
||||||
|
|
@ -1,3 +0,0 @@
|
||||||
[pytest]
|
|
||||||
pythonpath = src
|
|
||||||
testpaths = tests
|
|
||||||
|
|
@ -1,34 +0,0 @@
|
||||||
service:
|
|
||||||
name: brain-watchdog
|
|
||||||
owner_node: piha
|
|
||||||
exposure: private
|
|
||||||
description: >
|
|
||||||
External watchdog for the control-plane on VPS. Queries /summary over
|
|
||||||
Tailscale and alerts via Telegram Bot API directly — no dependency on the
|
|
||||||
control-plane itself. Freshness is computed locally from last_update epoch.
|
|
||||||
|
|
||||||
dependencies:
|
|
||||||
- control-plane # external — on VPS; deliberately untrusted for liveness
|
|
||||||
|
|
||||||
healthcheck:
|
|
||||||
type: docker
|
|
||||||
interval: 60s
|
|
||||||
timeout: 10s
|
|
||||||
retries: 3
|
|
||||||
start_period: 30s
|
|
||||||
|
|
||||||
restart_policy: unless-stopped
|
|
||||||
|
|
||||||
persistence:
|
|
||||||
paths:
|
|
||||||
- /data # state.json: fail_count, alerted, last_ok
|
|
||||||
|
|
||||||
runtime:
|
|
||||||
env_vars:
|
|
||||||
- CONTROL_PLANE_URL # Tailscale IP + port of operator-ui (required)
|
|
||||||
- STALE_THRESHOLD # seconds before brain is considered stale (default: 600)
|
|
||||||
- INTERVAL # poll interval seconds (default: 60)
|
|
||||||
- FAILS_BEFORE_ALERT # consecutive failures before Telegram alert (default: 3)
|
|
||||||
- TG_TOKEN # Telegram Bot API token (required)
|
|
||||||
- TG_CHAT_ID # Telegram chat/user ID (required)
|
|
||||||
- HEALTHCHECKS_URL # optional healthchecks.io ping URL
|
|
||||||
|
|
@ -1,157 +0,0 @@
|
||||||
"""
|
|
||||||
brain-watchdog: external watchdog for the control-plane on VPS.
|
|
||||||
|
|
||||||
Runs on PIHA; queries /summary directly over Tailscale and alerts via
|
|
||||||
Telegram Bot API without going through the control-plane itself.
|
|
||||||
Never trusts the self-reported "status" field — freshness is computed
|
|
||||||
locally from last_update epoch vs. time.time().
|
|
||||||
"""
|
|
||||||
|
|
||||||
import json
|
|
||||||
import os
|
|
||||||
import time
|
|
||||||
import urllib.error
|
|
||||||
import urllib.request
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
CONTROL_PLANE_URL = os.environ["CONTROL_PLANE_URL"].rstrip("/")
|
|
||||||
STALE_THRESHOLD = int(os.environ.get("STALE_THRESHOLD", "600"))
|
|
||||||
INTERVAL = int(os.environ.get("INTERVAL", "60"))
|
|
||||||
FAILS_BEFORE_ALERT = int(os.environ.get("FAILS_BEFORE_ALERT", "3"))
|
|
||||||
TG_TOKEN = os.environ["TG_TOKEN"]
|
|
||||||
TG_CHAT_ID = os.environ["TG_CHAT_ID"]
|
|
||||||
HEALTHCHECKS_URL = os.environ.get("HEALTHCHECKS_URL", "").strip()
|
|
||||||
|
|
||||||
STATE_FILE = Path("/data/state.json")
|
|
||||||
|
|
||||||
|
|
||||||
def load_state() -> dict:
|
|
||||||
if STATE_FILE.exists():
|
|
||||||
try:
|
|
||||||
return json.loads(STATE_FILE.read_text())
|
|
||||||
except Exception:
|
|
||||||
pass
|
|
||||||
return {"fail_count": 0, "alerted": False, "last_ok": 0.0}
|
|
||||||
|
|
||||||
|
|
||||||
def save_state(state: dict) -> None:
|
|
||||||
STATE_FILE.parent.mkdir(parents=True, exist_ok=True)
|
|
||||||
STATE_FILE.write_text(json.dumps(state))
|
|
||||||
|
|
||||||
|
|
||||||
def http_get(url: str, timeout: int = 10) -> tuple[int | None, dict | None]:
|
|
||||||
try:
|
|
||||||
with urllib.request.urlopen(url, timeout=timeout) as resp:
|
|
||||||
return resp.status, json.loads(resp.read())
|
|
||||||
except urllib.error.HTTPError as exc:
|
|
||||||
return exc.code, None
|
|
||||||
except Exception:
|
|
||||||
return None, None
|
|
||||||
|
|
||||||
|
|
||||||
def send_telegram(message: str) -> bool:
|
|
||||||
url = f"https://api.telegram.org/bot{TG_TOKEN}/sendMessage"
|
|
||||||
payload = json.dumps(
|
|
||||||
{"chat_id": TG_CHAT_ID, "text": message, "parse_mode": "HTML"}
|
|
||||||
).encode()
|
|
||||||
req = urllib.request.Request(
|
|
||||||
url, data=payload, headers={"Content-Type": "application/json"}
|
|
||||||
)
|
|
||||||
try:
|
|
||||||
with urllib.request.urlopen(req, timeout=10) as resp:
|
|
||||||
return resp.status == 200
|
|
||||||
except Exception as exc:
|
|
||||||
print(f"[telegram] send failed: {exc}", flush=True)
|
|
||||||
return False
|
|
||||||
|
|
||||||
|
|
||||||
def ping_healthchecks() -> None:
|
|
||||||
if not HEALTHCHECKS_URL:
|
|
||||||
return
|
|
||||||
try:
|
|
||||||
urllib.request.urlopen(HEALTHCHECKS_URL, timeout=10)
|
|
||||||
except Exception as exc:
|
|
||||||
print(f"[healthchecks] ping failed: {exc}", flush=True)
|
|
||||||
|
|
||||||
|
|
||||||
def check() -> tuple[bool, str]:
|
|
||||||
"""Return (ok, human-readable reason). Never reads 'status' field."""
|
|
||||||
status, body = http_get(f"{CONTROL_PLANE_URL}/summary")
|
|
||||||
|
|
||||||
if status is None:
|
|
||||||
return False, "panel unreachable (connection error)"
|
|
||||||
|
|
||||||
if status != 200:
|
|
||||||
return False, f"panel returned HTTP {status}"
|
|
||||||
|
|
||||||
if not body:
|
|
||||||
return False, "panel returned empty / invalid JSON"
|
|
||||||
|
|
||||||
raw = body.get("last_update")
|
|
||||||
if raw is None:
|
|
||||||
return False, "summary missing last_update field"
|
|
||||||
|
|
||||||
try:
|
|
||||||
last_update_ts = float(raw)
|
|
||||||
except (TypeError, ValueError):
|
|
||||||
return False, f"last_update not parseable: {raw!r}"
|
|
||||||
|
|
||||||
age = time.time() - last_update_ts
|
|
||||||
if age > STALE_THRESHOLD:
|
|
||||||
return False, (
|
|
||||||
f"brain stale: last update {int(age // 60)}m ago "
|
|
||||||
f"(threshold {STALE_THRESHOLD // 60}m)"
|
|
||||||
)
|
|
||||||
|
|
||||||
return True, f"ok (age {int(age)}s)"
|
|
||||||
|
|
||||||
|
|
||||||
def main() -> None:
|
|
||||||
print(
|
|
||||||
f"[brain-watchdog] starting — "
|
|
||||||
f"url={CONTROL_PLANE_URL} "
|
|
||||||
f"stale_threshold={STALE_THRESHOLD}s "
|
|
||||||
f"interval={INTERVAL}s "
|
|
||||||
f"fails_before_alert={FAILS_BEFORE_ALERT}",
|
|
||||||
flush=True,
|
|
||||||
)
|
|
||||||
state = load_state()
|
|
||||||
|
|
||||||
while True:
|
|
||||||
ok, reason = check()
|
|
||||||
ts = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
|
|
||||||
print(f"[{ts}] {'OK ' if ok else 'FAIL'} — {reason}", flush=True)
|
|
||||||
|
|
||||||
if ok:
|
|
||||||
if state["alerted"]:
|
|
||||||
send_telegram(
|
|
||||||
"✅ <b>brain-watchdog: control-plane RECOVERED</b>\n"
|
|
||||||
f"{reason}"
|
|
||||||
)
|
|
||||||
print("[telegram] sent recovery alert", flush=True)
|
|
||||||
state["fail_count"] = 0
|
|
||||||
state["alerted"] = False
|
|
||||||
state["last_ok"] = time.time()
|
|
||||||
save_state(state)
|
|
||||||
ping_healthchecks()
|
|
||||||
else:
|
|
||||||
state["fail_count"] = state.get("fail_count", 0) + 1
|
|
||||||
save_state(state)
|
|
||||||
|
|
||||||
if state["fail_count"] >= FAILS_BEFORE_ALERT and not state["alerted"]:
|
|
||||||
sent = send_telegram(
|
|
||||||
"🚨 <b>brain-watchdog: control-plane DOWN</b>\n"
|
|
||||||
f"Reason: {reason}\n"
|
|
||||||
f"Consecutive failures: {state['fail_count']}\n"
|
|
||||||
f"URL: <code>{CONTROL_PLANE_URL}</code>"
|
|
||||||
)
|
|
||||||
if sent:
|
|
||||||
state["alerted"] = True
|
|
||||||
save_state(state)
|
|
||||||
print("[telegram] sent alert", flush=True)
|
|
||||||
|
|
||||||
time.sleep(INTERVAL)
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
||||||
|
|
@ -1,66 +0,0 @@
|
||||||
"""
|
|
||||||
Tests for brain_watchdog.main.
|
|
||||||
|
|
||||||
Module-level env vars are required at import time; set them before the first
|
|
||||||
import of the module so tests can run without a real control-plane.
|
|
||||||
"""
|
|
||||||
import importlib.util
|
|
||||||
import os
|
|
||||||
import time
|
|
||||||
from unittest.mock import patch
|
|
||||||
|
|
||||||
os.environ.setdefault("CONTROL_PLANE_URL", "http://test-cp:8080")
|
|
||||||
os.environ.setdefault("TG_TOKEN", "test_token")
|
|
||||||
os.environ.setdefault("TG_CHAT_ID", "12345")
|
|
||||||
|
|
||||||
import brain_watchdog.main as bwm
|
|
||||||
|
|
||||||
|
|
||||||
def test_package_importable():
|
|
||||||
spec = importlib.util.find_spec("brain_watchdog")
|
|
||||||
assert spec is not None
|
|
||||||
|
|
||||||
|
|
||||||
def test_check_ok_fresh():
|
|
||||||
now = time.time()
|
|
||||||
with patch.object(bwm, "http_get", return_value=(200, {"last_update": now - 10})):
|
|
||||||
ok, reason = bwm.check()
|
|
||||||
assert ok
|
|
||||||
assert "ok" in reason
|
|
||||||
|
|
||||||
|
|
||||||
def test_check_fail_stale():
|
|
||||||
now = time.time()
|
|
||||||
stale_ts = now - (bwm.STALE_THRESHOLD + 120)
|
|
||||||
with patch.object(bwm, "http_get", return_value=(200, {"last_update": stale_ts})):
|
|
||||||
ok, reason = bwm.check()
|
|
||||||
assert not ok
|
|
||||||
assert "stale" in reason
|
|
||||||
|
|
||||||
|
|
||||||
def test_check_fail_unreachable():
|
|
||||||
with patch.object(bwm, "http_get", return_value=(None, None)):
|
|
||||||
ok, reason = bwm.check()
|
|
||||||
assert not ok
|
|
||||||
assert "unreachable" in reason
|
|
||||||
|
|
||||||
|
|
||||||
def test_check_fail_http_error():
|
|
||||||
with patch.object(bwm, "http_get", return_value=(503, None)):
|
|
||||||
ok, reason = bwm.check()
|
|
||||||
assert not ok
|
|
||||||
assert "503" in reason
|
|
||||||
|
|
||||||
|
|
||||||
def test_check_fail_missing_last_update():
|
|
||||||
with patch.object(bwm, "http_get", return_value=(200, {"other": "data"})):
|
|
||||||
ok, reason = bwm.check()
|
|
||||||
assert not ok
|
|
||||||
assert "last_update" in reason
|
|
||||||
|
|
||||||
|
|
||||||
def test_check_fail_unparseable_timestamp():
|
|
||||||
with patch.object(bwm, "http_get", return_value=(200, {"last_update": "not-a-number"})):
|
|
||||||
ok, reason = bwm.check()
|
|
||||||
assert not ok
|
|
||||||
assert "parseable" in reason
|
|
||||||
|
|
@ -1,24 +0,0 @@
|
||||||
FROM python:3.11-slim
|
|
||||||
|
|
||||||
WORKDIR /app
|
|
||||||
|
|
||||||
RUN pip install --no-cache-dir pyyaml
|
|
||||||
|
|
||||||
# Create homelab user
|
|
||||||
RUN useradd -m -u 1000 homelab
|
|
||||||
|
|
||||||
# Copy sources
|
|
||||||
COPY src/ /app/src/
|
|
||||||
# Also need the observer script if we want to run it from here,
|
|
||||||
# but I'll copy it from the repo during build or mount it.
|
|
||||||
# Actually, I'll copy the entire scripts/ directory to /repo/scripts
|
|
||||||
# so the supervisor/executor can find them.
|
|
||||||
|
|
||||||
# For simplicity, we'll assume the repo is mounted at /repo
|
|
||||||
ENV REPO_ROOT=/repo
|
|
||||||
ENV RUNTIME_PATH=/opt/homelab
|
|
||||||
ENV PYTHONUNBUFFERED=1
|
|
||||||
|
|
||||||
# Default command (will be overridden in docker-compose)
|
|
||||||
USER homelab
|
|
||||||
CMD ["python", "src/operator_ui.py"]
|
|
||||||
|
|
@ -1,73 +0,0 @@
|
||||||
#!/bin/bash
|
|
||||||
# services/control-plane/deploy-local.sh
|
|
||||||
set -e
|
|
||||||
|
|
||||||
# 1. Validate it is deploying control-plane
|
|
||||||
if [[ ! $(pwd) == *"/services/control-plane" ]]; then
|
|
||||||
echo "Error: Script must be run from services/control-plane directory"
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
|
|
||||||
if [[ ! -f "docker-compose.yml" ]]; then
|
|
||||||
echo "Error: docker-compose.yml not found"
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
|
|
||||||
echo "--- Preparing Control Plane Directories ---"
|
|
||||||
# 2. Prepare required dirs
|
|
||||||
# /opt/homelab/config
|
|
||||||
# /opt/homelab/actions/{pending,approved,rejected,running,completed,failed}
|
|
||||||
# /opt/homelab/world
|
|
||||||
# /opt/homelab/state
|
|
||||||
|
|
||||||
DIRS=(
|
|
||||||
"/opt/homelab/config"
|
|
||||||
"/opt/homelab/actions/pending"
|
|
||||||
"/opt/homelab/actions/approved"
|
|
||||||
"/opt/homelab/actions/rejected"
|
|
||||||
"/opt/homelab/actions/running"
|
|
||||||
"/opt/homelab/actions/completed"
|
|
||||||
"/opt/homelab/actions/failed"
|
|
||||||
"/opt/homelab/world"
|
|
||||||
"/opt/homelab/state"
|
|
||||||
)
|
|
||||||
|
|
||||||
for dir in "${DIRS[@]}"; do
|
|
||||||
if [ ! -d "$dir" ]; then
|
|
||||||
echo "Creating $dir"
|
|
||||||
sudo mkdir -p "$dir"
|
|
||||||
fi
|
|
||||||
done
|
|
||||||
|
|
||||||
# 3. chown/chmod for UID 1000 — self-healing: only calls sudo when actually needed
|
|
||||||
echo "Checking /opt/homelab ownership..."
|
|
||||||
_chown_needed=$(find /opt/homelab \( ! -uid 1000 -o ! -gid 1000 \) -print -quit 2>/dev/null)
|
|
||||||
if [[ -n "$_chown_needed" ]]; then
|
|
||||||
echo "Found files not owned by 1000:1000 (e.g. $_chown_needed) — fixing..."
|
|
||||||
sudo chown -R 1000:1000 /opt/homelab
|
|
||||||
else
|
|
||||||
echo "Ownership already correct, skipping chown"
|
|
||||||
fi
|
|
||||||
|
|
||||||
echo "Checking /opt/homelab directory permissions..."
|
|
||||||
_chmod_needed=$(find /opt/homelab -type d ! -perm -775 -print -quit 2>/dev/null)
|
|
||||||
if [[ -n "$_chmod_needed" ]]; then
|
|
||||||
echo "Found directories with wrong permissions (e.g. $_chmod_needed) — fixing..."
|
|
||||||
sudo chmod -R 775 /opt/homelab 2>/dev/null || true
|
|
||||||
else
|
|
||||||
echo "Permissions already correct, skipping chmod"
|
|
||||||
fi
|
|
||||||
|
|
||||||
# 4. Run docker compose up -d --build --force-recreate
|
|
||||||
echo "--- Starting Control Plane Services ---"
|
|
||||||
COMPOSE_ARGS="-f docker-compose.yml"
|
|
||||||
OVERRIDE_FILE="../../hosts/vps/runtime/control-plane/docker-compose.override.yml"
|
|
||||||
if [ -f "$OVERRIDE_FILE" ]; then
|
|
||||||
echo "Using override: $OVERRIDE_FILE"
|
|
||||||
COMPOSE_ARGS="$COMPOSE_ARGS -f $OVERRIDE_FILE"
|
|
||||||
fi
|
|
||||||
docker compose $COMPOSE_ARGS up -d --build --force-recreate
|
|
||||||
|
|
||||||
# 5. Print docker ps for control-plane containers
|
|
||||||
echo "--- Deployment Status ---"
|
|
||||||
docker ps --filter "name=control-plane"
|
|
||||||
|
|
@ -1,76 +0,0 @@
|
||||||
services:
|
|
||||||
operator-ui:
|
|
||||||
build: .
|
|
||||||
container_name: control-plane-ui
|
|
||||||
user: "1000:1000"
|
|
||||||
command: python src/operator_ui.py
|
|
||||||
ports:
|
|
||||||
- "18180:8080"
|
|
||||||
volumes:
|
|
||||||
- /opt/homelab:/opt/homelab
|
|
||||||
restart: unless-stopped
|
|
||||||
healthcheck:
|
|
||||||
test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://127.0.0.1:8080/', timeout=3).read()"]
|
|
||||||
interval: 30s
|
|
||||||
timeout: 10s
|
|
||||||
retries: 3
|
|
||||||
|
|
||||||
observer:
|
|
||||||
build: .
|
|
||||||
container_name: control-plane-observer
|
|
||||||
user: "1000:1000"
|
|
||||||
command: python /repo/scripts/observer/observer.py
|
|
||||||
volumes:
|
|
||||||
- /opt/homelab:/opt/homelab
|
|
||||||
- ../..:/repo:ro
|
|
||||||
restart: unless-stopped
|
|
||||||
environment:
|
|
||||||
- REPO_ROOT=/repo
|
|
||||||
- RUNTIME_PATH=/opt/homelab
|
|
||||||
healthcheck:
|
|
||||||
test: ["CMD", "test", "-f", "/opt/homelab/state/observer.heartbeat"]
|
|
||||||
interval: 30s
|
|
||||||
timeout: 5s
|
|
||||||
retries: 3
|
|
||||||
start_period: 5s
|
|
||||||
|
|
||||||
supervisor:
|
|
||||||
build: .
|
|
||||||
container_name: control-plane-supervisor
|
|
||||||
user: "1000:1000"
|
|
||||||
command: python src/supervisor.py
|
|
||||||
volumes:
|
|
||||||
- /opt/homelab:/opt/homelab
|
|
||||||
- ../..:/repo:ro
|
|
||||||
restart: unless-stopped
|
|
||||||
environment:
|
|
||||||
- REPO_ROOT=/repo
|
|
||||||
- RUNTIME_PATH=/opt/homelab
|
|
||||||
healthcheck:
|
|
||||||
test: ["CMD", "test", "-f", "/opt/homelab/state/supervisor.heartbeat"]
|
|
||||||
interval: 60s
|
|
||||||
timeout: 5s
|
|
||||||
retries: 3
|
|
||||||
start_period: 10s
|
|
||||||
|
|
||||||
executor:
|
|
||||||
build: .
|
|
||||||
container_name: control-plane-executor
|
|
||||||
user: "1000:1000"
|
|
||||||
group_add:
|
|
||||||
- "999"
|
|
||||||
command: python src/executor.py
|
|
||||||
volumes:
|
|
||||||
- /opt/homelab:/opt/homelab
|
|
||||||
- ../..:/repo
|
|
||||||
- /var/run/docker.sock:/var/run/docker.sock
|
|
||||||
restart: unless-stopped
|
|
||||||
environment:
|
|
||||||
- REPO_ROOT=/repo
|
|
||||||
- RUNTIME_PATH=/opt/homelab
|
|
||||||
healthcheck:
|
|
||||||
test: ["CMD", "test", "-f", "/opt/homelab/state/executor.heartbeat"]
|
|
||||||
interval: 30s
|
|
||||||
timeout: 5s
|
|
||||||
retries: 3
|
|
||||||
start_period: 5s
|
|
||||||
|
|
@ -1,19 +0,0 @@
|
||||||
[build-system]
|
|
||||||
requires = ["setuptools>=68"]
|
|
||||||
build-backend = "setuptools.build_meta"
|
|
||||||
|
|
||||||
[project]
|
|
||||||
name = "control-plane"
|
|
||||||
version = "0.1.0"
|
|
||||||
requires-python = ">=3.11"
|
|
||||||
dependencies = [
|
|
||||||
"pyyaml>=6.0",
|
|
||||||
]
|
|
||||||
|
|
||||||
[project.optional-dependencies]
|
|
||||||
dev = [
|
|
||||||
"pytest>=8.1",
|
|
||||||
]
|
|
||||||
|
|
||||||
[tool.pytest.ini_options]
|
|
||||||
testpaths = ["tests"]
|
|
||||||
|
|
@ -1,246 +0,0 @@
|
||||||
import os
|
|
||||||
import json
|
|
||||||
import time
|
|
||||||
import logging
|
|
||||||
import subprocess
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
|
|
||||||
def _atomic_write_json(path: Path, data) -> None:
|
|
||||||
"""Write JSON atomically: write to a sibling .tmp, fsync, then os.replace."""
|
|
||||||
tmp = path.with_suffix(".tmp")
|
|
||||||
with open(tmp, "w") as f:
|
|
||||||
json.dump(data, f, indent=2)
|
|
||||||
f.flush()
|
|
||||||
os.fsync(f.fileno())
|
|
||||||
os.replace(tmp, path)
|
|
||||||
|
|
||||||
# Constants and Paths
|
|
||||||
RUNTIME_PATH = os.getenv("RUNTIME_PATH", "/opt/homelab")
|
|
||||||
ACTIONS_DIR = Path(RUNTIME_PATH) / "actions"
|
|
||||||
REPO_ROOT = Path(os.getenv("REPO_ROOT", "/repo"))
|
|
||||||
|
|
||||||
# SSH configuration
|
|
||||||
# SSH_USER can be overridden per-deployment environment.
|
|
||||||
SSH_USER = os.getenv("SSH_USER", "oskar")
|
|
||||||
SSH_OPTIONS = [
|
|
||||||
"-o", "StrictHostKeyChecking=no",
|
|
||||||
"-o", "ConnectTimeout=10",
|
|
||||||
"-o", "BatchMode=yes",
|
|
||||||
]
|
|
||||||
|
|
||||||
# Logging setup
|
|
||||||
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
|
|
||||||
logger = logging.getLogger("executor")
|
|
||||||
|
|
||||||
|
|
||||||
class Executor:
|
|
||||||
def __init__(self):
|
|
||||||
self._ensure_dirs()
|
|
||||||
|
|
||||||
def _ensure_dirs(self):
|
|
||||||
for s in ["approved", "running", "completed", "failed", "rejected"]:
|
|
||||||
(ACTIONS_DIR / s).mkdir(parents=True, exist_ok=True)
|
|
||||||
|
|
||||||
def process_actions(self):
|
|
||||||
# Update heartbeat
|
|
||||||
heartbeat_file = ACTIONS_DIR.parent / "state" / "executor.heartbeat"
|
|
||||||
try:
|
|
||||||
heartbeat_file.touch()
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Failed to touch heartbeat file: {e}")
|
|
||||||
|
|
||||||
approved_dir = ACTIONS_DIR / "approved"
|
|
||||||
action_files = sorted(approved_dir.glob("*.json"))
|
|
||||||
|
|
||||||
for action_file in action_files:
|
|
||||||
self._execute_action(action_file)
|
|
||||||
|
|
||||||
def _execute_action(self, action_file):
|
|
||||||
action_id = action_file.stem
|
|
||||||
logger.info(f"Executing action: {action_id}")
|
|
||||||
|
|
||||||
# Move to running
|
|
||||||
running_path = ACTIONS_DIR / "running" / f"{action_id}.json"
|
|
||||||
try:
|
|
||||||
with open(action_file, "r") as f:
|
|
||||||
data = json.load(f)
|
|
||||||
data["status"] = "running"
|
|
||||||
data["started_at"] = time.time()
|
|
||||||
_atomic_write_json(running_path, data)
|
|
||||||
action_file.unlink()
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Failed to move {action_id} to running: {e}")
|
|
||||||
return
|
|
||||||
|
|
||||||
# Dispatch by action type
|
|
||||||
success = False
|
|
||||||
error_msg = ""
|
|
||||||
try:
|
|
||||||
action_type = data.get("type")
|
|
||||||
node = data.get("node")
|
|
||||||
service = data.get("service")
|
|
||||||
|
|
||||||
if action_type == "redeploy":
|
|
||||||
# Full service redeploy via the repo deploy script
|
|
||||||
cmd = [
|
|
||||||
str(REPO_ROOT / "scripts" / "deploy" / "deploy-node.sh"),
|
|
||||||
node,
|
|
||||||
service
|
|
||||||
]
|
|
||||||
logger.info(f"Running command: {' '.join(cmd)}")
|
|
||||||
result = subprocess.run(cmd, capture_output=True, text=True, cwd=str(REPO_ROOT))
|
|
||||||
if result.returncode == 0:
|
|
||||||
success = True
|
|
||||||
else:
|
|
||||||
success = False
|
|
||||||
error_msg = result.stderr or result.stdout
|
|
||||||
|
|
||||||
elif action_type == "container_restart":
|
|
||||||
# Lightweight restart: SSH to node and docker restart the container.
|
|
||||||
# container_name is set by the supervisor; falls back to service name.
|
|
||||||
container_name = data.get("container_name") or service
|
|
||||||
success, error_msg = self._execute_container_restart(node, container_name)
|
|
||||||
|
|
||||||
elif action_type == "disk_cleanup":
|
|
||||||
# Operator-approved aggressive Docker cleanup (image prune -a +
|
|
||||||
# volume prune). Commands come from the action payload so the
|
|
||||||
# supervisor controls exactly what runs; the executor adds a
|
|
||||||
# safety check to reject anything touching protected paths.
|
|
||||||
payload = data.get("payload", {})
|
|
||||||
success, error_msg = self._execute_disk_cleanup(node, payload)
|
|
||||||
|
|
||||||
elif action_type == "alert_only":
|
|
||||||
# Operator acknowledged the alert; no automated execution needed.
|
|
||||||
success = True
|
|
||||||
|
|
||||||
else:
|
|
||||||
success = False
|
|
||||||
error_msg = f"Unknown action type: {action_type}"
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
success = False
|
|
||||||
error_msg = str(e)
|
|
||||||
|
|
||||||
# Move to completed/failed
|
|
||||||
target_status = "completed" if success else "failed"
|
|
||||||
target_path = ACTIONS_DIR / target_status / f"{action_id}.json"
|
|
||||||
try:
|
|
||||||
data["status"] = target_status
|
|
||||||
data["finished_at"] = time.time()
|
|
||||||
if not success:
|
|
||||||
data["error"] = error_msg
|
|
||||||
_atomic_write_json(target_path, data)
|
|
||||||
running_path.unlink()
|
|
||||||
logger.info(f"Action {action_id} {target_status}")
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Failed to move {action_id} to {target_status}: {e}")
|
|
||||||
|
|
||||||
def _execute_container_restart(self, node, container_name, retry_delay=10):
|
|
||||||
"""
|
|
||||||
SSH to the target node and run `docker restart <container_name>`.
|
|
||||||
|
|
||||||
Attempts the restart up to 2 times (initial + 1 retry). If the first
|
|
||||||
attempt fails, waits retry_delay seconds then tries once more before
|
|
||||||
declaring the action failed.
|
|
||||||
|
|
||||||
Returns (success: bool, error_msg: str).
|
|
||||||
"""
|
|
||||||
cmd = [
|
|
||||||
"ssh",
|
|
||||||
*SSH_OPTIONS,
|
|
||||||
f"{SSH_USER}@{node}",
|
|
||||||
f"docker restart {container_name}",
|
|
||||||
]
|
|
||||||
logger.info(f"SSH container restart: {' '.join(cmd)}")
|
|
||||||
|
|
||||||
max_attempts = 2
|
|
||||||
last_error = ""
|
|
||||||
|
|
||||||
for attempt in range(1, max_attempts + 1):
|
|
||||||
result = subprocess.run(cmd, capture_output=True, text=True)
|
|
||||||
|
|
||||||
if result.returncode == 0:
|
|
||||||
logger.info(
|
|
||||||
f"Container '{container_name}' on {node} restarted successfully "
|
|
||||||
f"(attempt {attempt}/{max_attempts})"
|
|
||||||
)
|
|
||||||
return True, ""
|
|
||||||
|
|
||||||
last_error = (result.stderr or result.stdout).strip()
|
|
||||||
logger.warning(
|
|
||||||
f"container_restart attempt {attempt}/{max_attempts} failed "
|
|
||||||
f"for '{container_name}' on {node}: {last_error}"
|
|
||||||
)
|
|
||||||
|
|
||||||
if attempt < max_attempts:
|
|
||||||
logger.info(f"Retrying in {retry_delay}s...")
|
|
||||||
time.sleep(retry_delay)
|
|
||||||
|
|
||||||
logger.error(
|
|
||||||
f"container_restart exhausted all {max_attempts} attempts "
|
|
||||||
f"for '{container_name}' on {node}"
|
|
||||||
)
|
|
||||||
return False, last_error
|
|
||||||
|
|
||||||
def _execute_disk_cleanup(self, node: str, payload: dict):
|
|
||||||
"""
|
|
||||||
SSH to the target node and run the operator-approved disk cleanup
|
|
||||||
commands from the action payload.
|
|
||||||
|
|
||||||
Safety invariants enforced here regardless of payload content:
|
|
||||||
- No command may reference /opt/homelab/data/, /opt/homelab/config/,
|
|
||||||
or /opt/homelab/state/ (application data and configuration).
|
|
||||||
- No command may contain rm -rf / or similar destructive patterns.
|
|
||||||
If any command fails the safety check the entire action is rejected
|
|
||||||
(not run at all) and the rejection reason is recorded.
|
|
||||||
|
|
||||||
Returns (success: bool, error_msg: str).
|
|
||||||
"""
|
|
||||||
commands = payload.get("commands", [
|
|
||||||
"docker image prune -a -f",
|
|
||||||
"docker volume prune -f",
|
|
||||||
])
|
|
||||||
|
|
||||||
# Safety gate: reject commands that touch protected paths
|
|
||||||
FORBIDDEN = [
|
|
||||||
"/opt/homelab/data",
|
|
||||||
"/opt/homelab/config",
|
|
||||||
"/opt/homelab/state",
|
|
||||||
"rm -rf /",
|
|
||||||
]
|
|
||||||
for cmd in commands:
|
|
||||||
for forbidden in FORBIDDEN:
|
|
||||||
if forbidden in cmd:
|
|
||||||
msg = f"Rejected: command contains forbidden pattern '{forbidden}': {cmd}"
|
|
||||||
logger.error(msg)
|
|
||||||
return False, msg
|
|
||||||
|
|
||||||
full_command = " && ".join(commands)
|
|
||||||
cmd = [
|
|
||||||
"ssh",
|
|
||||||
*SSH_OPTIONS,
|
|
||||||
f"{SSH_USER}@{node}",
|
|
||||||
full_command,
|
|
||||||
]
|
|
||||||
logger.info(f"Disk cleanup on {node}: {full_command}")
|
|
||||||
|
|
||||||
result = subprocess.run(cmd, capture_output=True, text=True)
|
|
||||||
if result.returncode == 0:
|
|
||||||
logger.info(f"Disk cleanup on {node} succeeded")
|
|
||||||
return True, ""
|
|
||||||
|
|
||||||
error_msg = (result.stderr or result.stdout).strip()
|
|
||||||
logger.error(f"Disk cleanup on {node} failed: {error_msg}")
|
|
||||||
return False, error_msg
|
|
||||||
|
|
||||||
def loop(self, interval=10):
|
|
||||||
logger.info("Starting executor loop")
|
|
||||||
while True:
|
|
||||||
self.process_actions()
|
|
||||||
time.sleep(interval)
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
executor = Executor()
|
|
||||||
executor.loop()
|
|
||||||
|
|
@ -1,701 +0,0 @@
|
||||||
<!doctype html>
|
|
||||||
<html lang="en">
|
|
||||||
<head>
|
|
||||||
<meta charset="utf-8">
|
|
||||||
<meta name="viewport" content="width=device-width, initial-scale=1">
|
|
||||||
<title>Operator Control Plane</title>
|
|
||||||
<style>
|
|
||||||
:root {
|
|
||||||
--bg-color: #0a0c0e;
|
|
||||||
--sidebar-color: #14171a;
|
|
||||||
--card-color: #1c2024;
|
|
||||||
--border-color: #2a3540;
|
|
||||||
--text-color: #e7edf3;
|
|
||||||
--text-muted: #94a3b8;
|
|
||||||
--accent-color: #3eaf7c;
|
|
||||||
--nominal: #3eaf7c;
|
|
||||||
--degraded: #e7c000;
|
|
||||||
--unstable: #e67e22;
|
|
||||||
--reconciling: #3498db;
|
|
||||||
--error: #c0392b;
|
|
||||||
--safe: #3eaf7c;
|
|
||||||
--guarded: #e67e22;
|
|
||||||
--dangerous: #c0392b;
|
|
||||||
}
|
|
||||||
|
|
||||||
body {
|
|
||||||
margin: 0;
|
|
||||||
font-family: 'Inter', system-ui, -apple-system, sans-serif;
|
|
||||||
background: var(--bg-color);
|
|
||||||
color: var(--text-color);
|
|
||||||
display: flex;
|
|
||||||
height: 100vh;
|
|
||||||
overflow: hidden;
|
|
||||||
}
|
|
||||||
|
|
||||||
/* Sidebar */
|
|
||||||
.sidebar {
|
|
||||||
width: 240px;
|
|
||||||
background: var(--sidebar-color);
|
|
||||||
border-right: 1px solid var(--border-color);
|
|
||||||
display: flex;
|
|
||||||
flex-direction: column;
|
|
||||||
flex-shrink: 0;
|
|
||||||
}
|
|
||||||
|
|
||||||
.sidebar-header {
|
|
||||||
padding: 24px;
|
|
||||||
font-weight: 800;
|
|
||||||
font-size: 14px;
|
|
||||||
letter-spacing: 0.1em;
|
|
||||||
color: var(--accent-color);
|
|
||||||
border-bottom: 1px solid var(--border-color);
|
|
||||||
}
|
|
||||||
|
|
||||||
.nav-list {
|
|
||||||
list-style: none;
|
|
||||||
padding: 12px 0;
|
|
||||||
margin: 0;
|
|
||||||
flex-grow: 1;
|
|
||||||
}
|
|
||||||
|
|
||||||
.nav-item {
|
|
||||||
padding: 12px 24px;
|
|
||||||
cursor: pointer;
|
|
||||||
font-size: 14px;
|
|
||||||
color: var(--text-muted);
|
|
||||||
transition: all 0.2s;
|
|
||||||
display: flex;
|
|
||||||
align-items: center;
|
|
||||||
gap: 12px;
|
|
||||||
}
|
|
||||||
|
|
||||||
.nav-item:hover {
|
|
||||||
background: rgba(255, 255, 255, 0.05);
|
|
||||||
color: var(--text-color);
|
|
||||||
}
|
|
||||||
|
|
||||||
.nav-item.active {
|
|
||||||
background: rgba(62, 175, 124, 0.1);
|
|
||||||
color: var(--accent-color);
|
|
||||||
border-left: 3px solid var(--accent-color);
|
|
||||||
}
|
|
||||||
|
|
||||||
.sidebar-footer {
|
|
||||||
padding: 16px;
|
|
||||||
border-top: 1px solid var(--border-color);
|
|
||||||
font-size: 12px;
|
|
||||||
}
|
|
||||||
|
|
||||||
/* Content Area */
|
|
||||||
.main-content {
|
|
||||||
flex-grow: 1;
|
|
||||||
display: flex;
|
|
||||||
flex-direction: column;
|
|
||||||
overflow: hidden;
|
|
||||||
}
|
|
||||||
|
|
||||||
header {
|
|
||||||
height: 64px;
|
|
||||||
border-bottom: 1px solid var(--border-color);
|
|
||||||
display: flex;
|
|
||||||
align-items: center;
|
|
||||||
padding: 0 24px;
|
|
||||||
justify-content: space-between;
|
|
||||||
background: var(--bg-color);
|
|
||||||
}
|
|
||||||
|
|
||||||
.view-title {
|
|
||||||
font-size: 18px;
|
|
||||||
font-weight: 600;
|
|
||||||
}
|
|
||||||
|
|
||||||
.content-scroll {
|
|
||||||
flex-grow: 1;
|
|
||||||
overflow-y: auto;
|
|
||||||
padding: 24px;
|
|
||||||
}
|
|
||||||
|
|
||||||
/* Cards & Grids */
|
|
||||||
.grid {
|
|
||||||
display: grid;
|
|
||||||
grid-template-columns: repeat(auto-fill, minmax(350px, 1fr));
|
|
||||||
gap: 20px;
|
|
||||||
}
|
|
||||||
|
|
||||||
.card {
|
|
||||||
background: var(--card-color);
|
|
||||||
border: 1px solid var(--border-color);
|
|
||||||
padding: 20px;
|
|
||||||
border-radius: 4px;
|
|
||||||
position: relative;
|
|
||||||
}
|
|
||||||
|
|
||||||
.card-header {
|
|
||||||
display: flex;
|
|
||||||
justify-content: space-between;
|
|
||||||
align-items: center;
|
|
||||||
margin-bottom: 16px;
|
|
||||||
}
|
|
||||||
|
|
||||||
.card-title {
|
|
||||||
font-weight: 700;
|
|
||||||
font-size: 16px;
|
|
||||||
}
|
|
||||||
|
|
||||||
/* Status Badges */
|
|
||||||
.badge {
|
|
||||||
padding: 4px 8px;
|
|
||||||
border-radius: 4px;
|
|
||||||
font-size: 11px;
|
|
||||||
font-weight: 700;
|
|
||||||
text-transform: uppercase;
|
|
||||||
}
|
|
||||||
|
|
||||||
.status-nominal { background: rgba(62, 175, 124, 0.1); color: var(--nominal); }
|
|
||||||
.status-degraded { background: rgba(231, 192, 0, 0.1); color: var(--degraded); }
|
|
||||||
.status-unstable { background: rgba(230, 126, 34, 0.1); color: var(--unstable); }
|
|
||||||
.status-reconciling { background: rgba(52, 152, 219, 0.1); color: var(--reconciling); }
|
|
||||||
.status-error { background: rgba(192, 57, 43, 0.1); color: var(--error); }
|
|
||||||
|
|
||||||
/* Timeline */
|
|
||||||
.timeline {
|
|
||||||
display: flex;
|
|
||||||
flex-direction: column;
|
|
||||||
gap: 12px;
|
|
||||||
}
|
|
||||||
|
|
||||||
.event {
|
|
||||||
padding: 12px;
|
|
||||||
border-left: 2px solid var(--border-color);
|
|
||||||
background: rgba(255, 255, 255, 0.02);
|
|
||||||
font-family: ui-monospace, monospace;
|
|
||||||
font-size: 13px;
|
|
||||||
}
|
|
||||||
|
|
||||||
.event.high { border-left-color: var(--error); }
|
|
||||||
.event.medium { border-left-color: var(--unstable); }
|
|
||||||
.event.low { border-left-color: var(--nominal); }
|
|
||||||
|
|
||||||
.event-header {
|
|
||||||
display: flex;
|
|
||||||
justify-content: space-between;
|
|
||||||
margin-bottom: 4px;
|
|
||||||
color: var(--text-muted);
|
|
||||||
}
|
|
||||||
|
|
||||||
/* Forms & Inputs */
|
|
||||||
.controls {
|
|
||||||
display: flex;
|
|
||||||
gap: 12px;
|
|
||||||
margin-top: 20px;
|
|
||||||
}
|
|
||||||
|
|
||||||
input, button {
|
|
||||||
background: var(--card-color);
|
|
||||||
border: 1px solid var(--border-color);
|
|
||||||
color: var(--text-color);
|
|
||||||
padding: 8px 16px;
|
|
||||||
font-size: 14px;
|
|
||||||
border-radius: 4px;
|
|
||||||
}
|
|
||||||
|
|
||||||
button {
|
|
||||||
cursor: pointer;
|
|
||||||
font-weight: 600;
|
|
||||||
}
|
|
||||||
|
|
||||||
button:hover { background: var(--border-color); }
|
|
||||||
|
|
||||||
.btn-primary { background: var(--accent-color); color: white; border: none; }
|
|
||||||
.btn-primary:hover { background: #359b6d; }
|
|
||||||
|
|
||||||
/* Utility */
|
|
||||||
.hidden { display: none !important; }
|
|
||||||
.mono { font-family: ui-monospace, monospace; }
|
|
||||||
.label { color: var(--text-muted); font-size: 12px; margin-bottom: 4px; }
|
|
||||||
.value { font-weight: 500; margin-bottom: 12px; }
|
|
||||||
|
|
||||||
.risk-safe { background: rgba(62, 175, 124, 0.1); color: var(--safe); }
|
|
||||||
.risk-guarded { background: rgba(230, 126, 34, 0.1); color: var(--guarded); }
|
|
||||||
.risk-dangerous { background: rgba(192, 57, 43, 0.1); color: var(--dangerous); }
|
|
||||||
|
|
||||||
</style>
|
|
||||||
</head>
|
|
||||||
<body>
|
|
||||||
<aside class="sidebar">
|
|
||||||
<div class="sidebar-header">HOMELAB OPERATOR</div>
|
|
||||||
<ul class="nav-list">
|
|
||||||
<li class="nav-item active" onclick="showView('dashboard', this)">
|
|
||||||
<span>Dashboard</span>
|
|
||||||
</li>
|
|
||||||
<li class="nav-item" onclick="showView('actions', this)">
|
|
||||||
<span>Action Queue</span>
|
|
||||||
</li>
|
|
||||||
<li class="nav-item" onclick="showView('nodes', this)">
|
|
||||||
<span>Nodes</span>
|
|
||||||
</li>
|
|
||||||
<li class="nav-item" onclick="showView('services', this)">
|
|
||||||
<span>Services</span>
|
|
||||||
</li>
|
|
||||||
<li class="nav-item" onclick="showView('deployments', this)">
|
|
||||||
<span>Deployments</span>
|
|
||||||
</li>
|
|
||||||
<li class="nav-item" onclick="showView('topology', this)">
|
|
||||||
<span>Topology</span>
|
|
||||||
</li>
|
|
||||||
<li class="nav-item" onclick="showView('events', this)">
|
|
||||||
<span>Events</span>
|
|
||||||
</li>
|
|
||||||
<li class="nav-item" onclick="showView('correlation', this)">
|
|
||||||
<span>Correlation</span>
|
|
||||||
</li>
|
|
||||||
<li class="nav-item" onclick="showView('recommendations', this)">
|
|
||||||
<span>Recommendations</span>
|
|
||||||
</li>
|
|
||||||
<li class="nav-item" onclick="showView('settings', this)">
|
|
||||||
<span>Settings</span>
|
|
||||||
</li>
|
|
||||||
</ul>
|
|
||||||
<div class="sidebar-footer">
|
|
||||||
<div id="summary-status">System Status: Loading...</div>
|
|
||||||
</div>
|
|
||||||
</aside>
|
|
||||||
|
|
||||||
<main class="main-content">
|
|
||||||
<div id="stale-banner" class="hidden" style="background:var(--error); color:white; padding:8px 24px; font-weight:bold; font-size:12px; text-align:center; letter-spacing:0.05em">
|
|
||||||
RUNTIME STATE IS STALE
|
|
||||||
</div>
|
|
||||||
<header>
|
|
||||||
<div style="display:flex; align-items:center; gap:20px">
|
|
||||||
<div class="view-title" id="current-view-title">Dashboard</div>
|
|
||||||
<select id="operator-mode" onchange="setOperatorMode(this.value)" style="background:var(--sidebar-color); border:1px solid var(--border-color); color:var(--accent-color); font-weight:bold; font-size:12px; padding:4px 8px">
|
|
||||||
<option value="observe">OBSERVE</option>
|
|
||||||
<option value="recommend">RECOMMEND</option>
|
|
||||||
<option value="approval" selected>APPROVAL</option>
|
|
||||||
<option value="autonomous">AUTONOMOUS</option>
|
|
||||||
<option value="maintenance">MAINTENANCE</option>
|
|
||||||
</select>
|
|
||||||
</div>
|
|
||||||
<div class="header-actions">
|
|
||||||
<button onclick="refreshData()">Refresh</button>
|
|
||||||
</div>
|
|
||||||
</header>
|
|
||||||
|
|
||||||
<div class="content-scroll">
|
|
||||||
<!-- Dashboard View -->
|
|
||||||
<div id="view-dashboard" class="view">
|
|
||||||
<div class="grid">
|
|
||||||
<div class="card">
|
|
||||||
<div class="card-title">System Overview</div>
|
|
||||||
<div id="dashboard-summary" style="margin-top:20px"></div>
|
|
||||||
</div>
|
|
||||||
<div class="card">
|
|
||||||
<div class="card-title">Pending Actions</div>
|
|
||||||
<div id="dashboard-actions-summary" style="margin-top:20px"></div>
|
|
||||||
</div>
|
|
||||||
<div class="card">
|
|
||||||
<div class="card-title">Active Incidents</div>
|
|
||||||
<div id="dashboard-incidents" style="margin-top:20px"></div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<!-- Actions View -->
|
|
||||||
<div id="view-actions" class="view hidden">
|
|
||||||
<div style="display:grid; grid-template-columns: 1fr 1fr; gap:24px">
|
|
||||||
<div>
|
|
||||||
<h3>Pending Approval</h3>
|
|
||||||
<div id="actions-pending" class="timeline"></div>
|
|
||||||
</div>
|
|
||||||
<div>
|
|
||||||
<h3>Active / History</h3>
|
|
||||||
<div id="actions-history" class="timeline"></div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<!-- Nodes View -->
|
|
||||||
<div id="view-nodes" class="view hidden">
|
|
||||||
<div class="grid" id="nodes-list"></div>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<!-- Services View -->
|
|
||||||
<div id="view-services" class="view hidden">
|
|
||||||
<div class="grid" id="services-list"></div>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<!-- Deployments View -->
|
|
||||||
<div id="view-deployments" class="view hidden">
|
|
||||||
<div class="grid" id="deployments-list"></div>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<!-- Topology View -->
|
|
||||||
<div id="view-topology" class="view hidden">
|
|
||||||
<div class="card" style="min-height:500px">
|
|
||||||
<div class="card-title">Runtime Topology</div>
|
|
||||||
<div id="topology-map" style="margin-top:20px; display:flex; flex-wrap:wrap; gap:40px; justify-content:center"></div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<!-- Events View -->
|
|
||||||
<div id="view-events" class="view hidden">
|
|
||||||
<div class="timeline" id="events-timeline"></div>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<!-- Correlation View -->
|
|
||||||
<div id="view-correlation" class="view hidden">
|
|
||||||
<div id="correlation-chains" class="grid"></div>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<!-- Recommendations View -->
|
|
||||||
<div id="view-recommendations" class="view hidden">
|
|
||||||
<div class="grid" id="recommendations-list"></div>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<!-- Settings View -->
|
|
||||||
<div id="view-settings" class="view hidden">
|
|
||||||
<div class="card">
|
|
||||||
<div class="card-title">Configuration</div>
|
|
||||||
<div id="settings-content" style="margin-top:20px"></div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
</main>
|
|
||||||
|
|
||||||
<script>
|
|
||||||
let currentView = 'dashboard';
|
|
||||||
const pollInterval = 5000;
|
|
||||||
|
|
||||||
function showView(viewId, el) {
|
|
||||||
document.querySelectorAll('.view').forEach(v => v.classList.add('hidden'));
|
|
||||||
document.getElementById('view-' + viewId).classList.remove('hidden');
|
|
||||||
document.querySelectorAll('.nav-item').forEach(i => i.classList.remove('active'));
|
|
||||||
if (el) el.classList.add('active');
|
|
||||||
currentView = viewId;
|
|
||||||
document.getElementById('current-view-title').textContent = viewId.charAt(0).toUpperCase() + viewId.slice(1);
|
|
||||||
refreshData();
|
|
||||||
}
|
|
||||||
|
|
||||||
async function fetchData(endpoint) {
|
|
||||||
try {
|
|
||||||
const res = await fetch(endpoint, {cache: 'no-store'});
|
|
||||||
return await res.json();
|
|
||||||
} catch (e) {
|
|
||||||
console.error('Fetch error:', endpoint, e);
|
|
||||||
return null;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
async function postData(endpoint, data) {
|
|
||||||
try {
|
|
||||||
const res = await fetch(endpoint, {
|
|
||||||
method: 'POST',
|
|
||||||
headers: {'Content-Type': 'application/json'},
|
|
||||||
body: JSON.stringify(data)
|
|
||||||
});
|
|
||||||
return await res.json();
|
|
||||||
} catch (e) {
|
|
||||||
console.error('Post error:', endpoint, e);
|
|
||||||
return null;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
async function mutateAction(id, status) {
|
|
||||||
const res = await postData('/action/mutate', {id, status});
|
|
||||||
if (res && res.status === 'ok') {
|
|
||||||
refreshData();
|
|
||||||
} else {
|
|
||||||
alert('Mutation failed');
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
async function setOperatorMode(mode) {
|
|
||||||
console.log('Operator mode set to:', mode);
|
|
||||||
const res = await postData('/mode', {mode});
|
|
||||||
if (res && res.status === 'ok') {
|
|
||||||
console.log('Mode updated successfully');
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
function formatTime(ts) {
|
|
||||||
if (!ts) return 'N/A';
|
|
||||||
return new Date(ts * 1000).toLocaleString();
|
|
||||||
}
|
|
||||||
|
|
||||||
function getStatusClass(status) {
|
|
||||||
status = (status || '').toLowerCase();
|
|
||||||
if (['nominal', 'healthy', 'ok', 'up'].includes(status)) return 'status-nominal';
|
|
||||||
if (['degraded', 'warning'].includes(status)) return 'status-degraded';
|
|
||||||
if (['unstable'].includes(status)) return 'status-unstable';
|
|
||||||
if (['reconciling'].includes(status)) return 'status-reconciling';
|
|
||||||
if (['error', 'down', 'failed'].includes(status)) return 'status-error';
|
|
||||||
return '';
|
|
||||||
}
|
|
||||||
|
|
||||||
async function refreshData() {
|
|
||||||
// Refresh summary always
|
|
||||||
const summary = await fetchData('/summary');
|
|
||||||
if (summary) {
|
|
||||||
const statusEl = document.getElementById('summary-status');
|
|
||||||
statusEl.textContent = `System Status: ${summary.status.toUpperCase()}`;
|
|
||||||
statusEl.className = 'sidebar-footer ' + getStatusClass(summary.status);
|
|
||||||
|
|
||||||
// Handle stale state
|
|
||||||
const staleBanner = document.getElementById('stale-banner');
|
|
||||||
if (summary.stale) {
|
|
||||||
staleBanner.classList.remove('hidden');
|
|
||||||
staleBanner.textContent = `CRITICAL: Runtime state is STALE (Last update: ${formatTime(summary.last_update)})`;
|
|
||||||
} else {
|
|
||||||
staleBanner.classList.add('hidden');
|
|
||||||
}
|
|
||||||
|
|
||||||
if (currentView === 'dashboard') {
|
|
||||||
const dashSummary = document.getElementById('dashboard-summary');
|
|
||||||
dashSummary.innerHTML = `
|
|
||||||
<div class="label">Nodes</div><div class="value">${summary.node_count}</div>
|
|
||||||
<div class="label">Services</div><div class="value">${summary.service_count}</div>
|
|
||||||
<div class="label">Last Update</div><div class="value">${formatTime(summary.last_update)}</div>
|
|
||||||
`;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
if (currentView === 'dashboard' || currentView === 'actions') {
|
|
||||||
const actions = await fetchData('/actions');
|
|
||||||
if (actions) {
|
|
||||||
if (currentView === 'dashboard') {
|
|
||||||
const dashActions = document.getElementById('dashboard-actions-summary');
|
|
||||||
const pendingCount = actions.pending.length;
|
|
||||||
dashActions.innerHTML = `
|
|
||||||
<div class="label">Pending</div><div class="value" style="color:var(--guarded)">${pendingCount}</div>
|
|
||||||
<div class="label">Running</div><div class="value" style="color:var(--reconciling)">${actions.running.length}</div>
|
|
||||||
`;
|
|
||||||
}
|
|
||||||
if (currentView === 'actions') {
|
|
||||||
const pendingEl = document.getElementById('actions-pending');
|
|
||||||
const historyEl = document.getElementById('actions-history');
|
|
||||||
|
|
||||||
pendingEl.innerHTML = actions.pending.map(a => `
|
|
||||||
<div class="card" style="margin-bottom:12px">
|
|
||||||
<div class="card-header">
|
|
||||||
<div class="card-title">${(a.action_type || a.type || 'unknown').toUpperCase()}</div>
|
|
||||||
<span class="badge risk-${a.risk_level}">${a.risk_level}</span>
|
|
||||||
</div>
|
|
||||||
<p>${a.description || a.action_type || 'No description'}</p>
|
|
||||||
<div class="label">Target</div><div class="value">${a.node || (a.target && a.target.node) || 'unknown'} ${(a.service || (a.target && a.target.service)) || ''}</div>
|
|
||||||
<div class="label">Confidence</div><div class="value">${Math.round((a.confidence || 0)*100)}%</div>
|
|
||||||
<div class="controls">
|
|
||||||
<button class="btn-primary" onclick="mutateAction('${a.id}', 'approved')">Approve</button>
|
|
||||||
<button onclick="mutateAction('${a.id}', 'rejected')">Reject</button>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
`).join('') || 'No pending actions.';
|
|
||||||
|
|
||||||
const history = [...actions.approved, ...actions.running, ...actions.completed, ...actions.failed, ...actions.rejected];
|
|
||||||
historyEl.innerHTML = history.sort((a,b) => (b.timestamp || b.updated_at || 0) - (a.timestamp || a.updated_at || 0)).map(a => `
|
|
||||||
<div class="event">
|
|
||||||
<div class="event-header">
|
|
||||||
<span>${(a.action_type || a.type || 'unknown').toUpperCase()}</span>
|
|
||||||
<span class="badge ${getStatusClass(a.status)}">${a.status}</span>
|
|
||||||
</div>
|
|
||||||
<div>${a.description || a.action_type || 'No description'}</div>
|
|
||||||
<small>${formatTime(a.timestamp || a.updated_at)} | Target: ${a.node || (a.target && a.target.node)}</small>
|
|
||||||
${a.status === 'approved' ? `<div class="controls"><button class="btn-primary" onclick="mutateAction('${a.id}', 'running')">Execute</button></div>` : ''}
|
|
||||||
${a.transition_history ? `
|
|
||||||
<div style="margin-top:8px; font-size:10px; color:var(--text-muted)">
|
|
||||||
<strong>Trace:</strong> ${a.transition_history.map(h => `${h.from}->${h.to}`).join(' → ')}
|
|
||||||
</div>
|
|
||||||
` : ''}
|
|
||||||
</div>
|
|
||||||
`).join('') || 'No history.';
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
if (currentView === 'dashboard' || currentView === 'events') {
|
|
||||||
const incidents = await fetchData('/incidents');
|
|
||||||
if (currentView === 'dashboard') {
|
|
||||||
const dashIncidents = document.getElementById('dashboard-incidents');
|
|
||||||
if (!incidents || incidents.length === 0) {
|
|
||||||
dashIncidents.textContent = 'No active incidents.';
|
|
||||||
} else {
|
|
||||||
dashIncidents.innerHTML = incidents.map(inc => `
|
|
||||||
<div class="event ${inc.severity}">
|
|
||||||
<strong>${inc.severity.toUpperCase()}:</strong> ${inc.message}<br>
|
|
||||||
<small>${formatTime(inc.timestamp)} | Node: ${inc.node}</small>
|
|
||||||
</div>
|
|
||||||
`).join('');
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
if (currentView === 'nodes') {
|
|
||||||
const nodes = await fetchData('/nodes');
|
|
||||||
const list = document.getElementById('nodes-list');
|
|
||||||
list.innerHTML = nodes.map(node => `
|
|
||||||
<div class="card">
|
|
||||||
<div class="card-header">
|
|
||||||
<div class="card-title">${node.hostname}</div>
|
|
||||||
<span class="badge ${getStatusClass(node.health)}">${node.health}</span>
|
|
||||||
</div>
|
|
||||||
<div class="label">ID</div><div class="value mono">${node.id}</div>
|
|
||||||
<div class="label">Capabilities</div><div class="value">${node.capabilities.join(', ')}</div>
|
|
||||||
<div class="label">Connectivity</div><div class="value">${node.connectivity}</div>
|
|
||||||
<div class="label">Incidents (24h)</div><div class="value">${node.incidents}</div>
|
|
||||||
<div class="label">Last Seen</div><div class="value">${formatTime(node.last_seen)}</div>
|
|
||||||
<div class="label">Runtime Status</div><div class="value">${node.status}</div>
|
|
||||||
</div>
|
|
||||||
`).join('');
|
|
||||||
}
|
|
||||||
|
|
||||||
if (currentView === 'services') {
|
|
||||||
const services = await fetchData('/services');
|
|
||||||
const list = document.getElementById('services-list');
|
|
||||||
list.innerHTML = services.map(svc => `
|
|
||||||
<div class="card">
|
|
||||||
<div class="card-header">
|
|
||||||
<div class="card-title">${svc.name}</div>
|
|
||||||
<span class="badge ${getStatusClass(svc.health)}">${svc.health}</span>
|
|
||||||
</div>
|
|
||||||
<div class="label">State (Desired/Actual)</div><div class="value">${svc.desired_state} / ${svc.actual_state}</div>
|
|
||||||
<div class="label">Deployment</div><div class="value">${svc.deployment_state}</div>
|
|
||||||
<div class="label">Dependencies</div><div class="value">${svc.dependencies.join(', ') || 'None'}</div>
|
|
||||||
<div class="label">Recommendations</div><div class="value">${svc.recommendations.join(', ') || 'None'}</div>
|
|
||||||
</div>
|
|
||||||
`).join('');
|
|
||||||
}
|
|
||||||
|
|
||||||
if (currentView === 'deployments') {
|
|
||||||
const deps = await fetchData('/deployments');
|
|
||||||
const list = document.getElementById('deployments-list');
|
|
||||||
list.innerHTML = deps.map(dep => `
|
|
||||||
<div class="card">
|
|
||||||
<div class="card-header">
|
|
||||||
<div class="card-title">${dep.service}</div>
|
|
||||||
<span class="badge ${dep.status === 'failed' ? 'status-error' : 'status-reconciling'}">${dep.status}</span>
|
|
||||||
</div>
|
|
||||||
<div class="label">ID</div><div class="value mono">${dep.id}</div>
|
|
||||||
<div class="label">Stage</div><div class="value">${dep.stage}</div>
|
|
||||||
<div class="label">Diagnostics</div><div class="value">${dep.diagnostics || 'No data'}</div>
|
|
||||||
<div class="label">Resumable</div><div class="value">${dep.resumable ? 'Yes' : 'No'}</div>
|
|
||||||
${dep.resumable ? '<button class="btn-primary">Resume</button>' : ''}
|
|
||||||
</div>
|
|
||||||
`).join('');
|
|
||||||
}
|
|
||||||
|
|
||||||
if (currentView === 'events') {
|
|
||||||
const events = await fetchData('/events');
|
|
||||||
const timeline = document.getElementById('events-timeline');
|
|
||||||
timeline.innerHTML = events.map(ev => `
|
|
||||||
<div class="event ${ev.severity}">
|
|
||||||
<div class="event-header">
|
|
||||||
<span>${ev.type.toUpperCase()}</span>
|
|
||||||
<span>${formatTime(ev.timestamp)}</span>
|
|
||||||
</div>
|
|
||||||
<div>${ev.message}</div>
|
|
||||||
<div class="label" style="margin-top:8px">Node: ${ev.node} ${ev.service ? '| Service: ' + ev.service : ''}</div>
|
|
||||||
</div>
|
|
||||||
`).join('');
|
|
||||||
}
|
|
||||||
|
|
||||||
if (currentView === 'recommendations') {
|
|
||||||
const recs = await fetchData('/recommendations');
|
|
||||||
const list = document.getElementById('recommendations-list');
|
|
||||||
list.innerHTML = recs.map(rec => `
|
|
||||||
<div class="card">
|
|
||||||
<div class="card-header">
|
|
||||||
<div class="card-title">${rec.title}</div>
|
|
||||||
<span class="badge risk-${rec.risk_level}">${rec.risk_level}</span>
|
|
||||||
</div>
|
|
||||||
<p>${rec.description}</p>
|
|
||||||
<div class="label">Confidence</div><div class="value">${Math.round(rec.confidence * 100)}%</div>
|
|
||||||
<div class="label">Autonomous Eligible</div><div class="value">${rec.autonomous_eligible ? 'Yes' : 'No'}</div>
|
|
||||||
<div class="label">Blocked Actions</div><div class="value">${rec.blocked_actions.join(', ') || 'None'}</div>
|
|
||||||
<div class="controls">
|
|
||||||
<button class="btn-primary" ${rec.risk_level === 'dangerous' ? 'style="background:var(--dangerous)"' : ''}>Approve Action</button>
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
`).join('');
|
|
||||||
}
|
|
||||||
|
|
||||||
if (currentView === 'topology') {
|
|
||||||
const nodes = await fetchData('/nodes');
|
|
||||||
const services = await fetchData('/services');
|
|
||||||
const topMap = document.getElementById('topology-map');
|
|
||||||
if (nodes && services) {
|
|
||||||
topMap.innerHTML = nodes.map(node => {
|
|
||||||
const nodeServices = services.filter(s => s.node === node.hostname || s.node === node.id);
|
|
||||||
return `
|
|
||||||
<div class="card" style="width:250px; border: 1px solid ${node.health === 'nominal' ? 'var(--border-color)' : 'var(--error)'}">
|
|
||||||
<div class="card-header">
|
|
||||||
<div class="card-title">${node.hostname}</div>
|
|
||||||
<span class="badge ${getStatusClass(node.health)}">${node.health}</span>
|
|
||||||
</div>
|
|
||||||
<div class="label">Capabilities</div>
|
|
||||||
<div class="value" style="font-size:11px">${node.capabilities.join(', ')}</div>
|
|
||||||
<div class="label">Services</div>
|
|
||||||
<div style="font-size:12px; margin-bottom:10px">
|
|
||||||
${nodeServices.length > 0 ? nodeServices.map(s => `
|
|
||||||
<div style="display:flex; justify-content:space-between; margin-bottom:4px; padding:4px; background:rgba(255,255,255,0.03)">
|
|
||||||
<span>${s.name}</span>
|
|
||||||
<span class="${getStatusClass(s.health)}" style="font-size:10px">${s.health}</span>
|
|
||||||
</div>
|
|
||||||
${s.dependencies.length > 0 ? `<div style="font-size:9px; color:var(--text-muted); margin-left:8px; margin-bottom:4px">dep: ${s.dependencies.join(', ')}</div>` : ''}
|
|
||||||
`).join('') : '<div class="value">None</div>'}
|
|
||||||
</div>
|
|
||||||
</div>
|
|
||||||
`;
|
|
||||||
}).join('');
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
if (currentView === 'correlation') {
|
|
||||||
const incidents = await fetchData('/incidents');
|
|
||||||
const actions = await fetchData('/actions');
|
|
||||||
const list = document.getElementById('correlation-chains');
|
|
||||||
if (incidents && actions) {
|
|
||||||
const allActions = Object.values(actions).flat();
|
|
||||||
list.innerHTML = incidents.map(inc => {
|
|
||||||
const related = allActions.filter(a => a.correlation_chain && a.correlation_chain.includes(inc.id));
|
|
||||||
return `
|
|
||||||
<div class="card">
|
|
||||||
<div class="card-header">
|
|
||||||
<div class="card-title">Incident: ${inc.id || 'INC-001'}</div>
|
|
||||||
<span class="badge status-error">Active</span>
|
|
||||||
</div>
|
|
||||||
<p>${inc.message}</p>
|
|
||||||
<div class="label">Related Actions</div>
|
|
||||||
${related.map(a => `
|
|
||||||
<div class="event" style="margin-top:5px">
|
|
||||||
<strong>${a.type}</strong> (${a.status})<br>
|
|
||||||
<small>${a.description}</small>
|
|
||||||
</div>
|
|
||||||
`).join('') || '<div class="value">No actions yet</div>'}
|
|
||||||
</div>
|
|
||||||
`;
|
|
||||||
}).join('');
|
|
||||||
}
|
|
||||||
}
|
|
||||||
if (currentView === 'settings') {
|
|
||||||
const config = await fetchData('/config');
|
|
||||||
const content = document.getElementById('settings-content');
|
|
||||||
content.innerHTML = `
|
|
||||||
<div class="label">Auto Mode</div>
|
|
||||||
<div class="value">${config.auto_mode ? 'Enabled' : 'Disabled'}</div>
|
|
||||||
<div class="label">Action Thresholds</div>
|
|
||||||
<div class="value mono">${JSON.stringify(config.action_thresholds, null, 2)}</div>
|
|
||||||
<div class="label">Telegram Integration</div>
|
|
||||||
<div class="value" style="color:var(--text-muted)">Ready for mobile approval flows. Hook: /api/v1/telegram/webhook</div>
|
|
||||||
<button onclick="alert('Settings update not implemented in this demo')">Edit Configuration</button>
|
|
||||||
`;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// Initial load
|
|
||||||
refreshData();
|
|
||||||
// Poll for updates
|
|
||||||
setInterval(refreshData, pollInterval);
|
|
||||||
|
|
||||||
</script>
|
|
||||||
</body>
|
|
||||||
</html>
|
|
||||||
|
|
@ -1,426 +0,0 @@
|
||||||
import heapq
|
|
||||||
import json
|
|
||||||
import os
|
|
||||||
import re
|
|
||||||
import time
|
|
||||||
from datetime import datetime
|
|
||||||
from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
|
|
||||||
STATE_DIR = Path(os.getenv("HOMELAB_STATE_ROOT", "/opt/homelab/state"))
|
|
||||||
EVENTS_DIR = Path(os.getenv("HOMELAB_EVENTS_ROOT", "/opt/homelab/events"))
|
|
||||||
WORLD_DIR = Path(os.getenv("HOMELAB_WORLD_ROOT", "/opt/homelab/world"))
|
|
||||||
ACTIONS_DIR = Path(os.getenv("HOMELAB_ACTIONS_ROOT", "/opt/homelab/actions"))
|
|
||||||
CONFIG_DIR = Path(os.getenv("HOMELAB_CONFIG_ROOT", "/opt/homelab/config"))
|
|
||||||
|
|
||||||
STATIC_DIR = Path(__file__).parent
|
|
||||||
|
|
||||||
_EVENT_TS_RE = re.compile(r"-(\d{9,11})-")
|
|
||||||
|
|
||||||
DEFAULT_CONFIG = {
|
|
||||||
"operator_mode": "approval",
|
|
||||||
"auto_mode": True,
|
|
||||||
"action_thresholds": {
|
|
||||||
"restart_ha": 0.8,
|
|
||||||
"check_network": 0.9,
|
|
||||||
},
|
|
||||||
"default_threshold": 0.9,
|
|
||||||
"allowed_auto_actions": ["restart_ha"],
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
def read_json_file(path, default=None):
|
|
||||||
if not path.exists():
|
|
||||||
return default if default is not None else []
|
|
||||||
try:
|
|
||||||
return json.loads(path.read_text())
|
|
||||||
except Exception:
|
|
||||||
return default if default is not None else []
|
|
||||||
|
|
||||||
|
|
||||||
def get_config():
|
|
||||||
config_path = STATE_DIR / "operator-config.json"
|
|
||||||
if config_path.exists():
|
|
||||||
return read_json_file(config_path, DEFAULT_CONFIG)
|
|
||||||
return DEFAULT_CONFIG
|
|
||||||
|
|
||||||
|
|
||||||
def save_config(config):
|
|
||||||
STATE_DIR.mkdir(parents=True, exist_ok=True)
|
|
||||||
(STATE_DIR / "operator-config.json").write_text(json.dumps(config, indent=2))
|
|
||||||
|
|
||||||
|
|
||||||
EVENTS_MAX_AGE_HOURS = int(os.getenv("EVENTS_MAX_AGE_HOURS", "24"))
|
|
||||||
EVENTS_MAX_COUNT = int(os.getenv("EVENTS_MAX_COUNT", "200"))
|
|
||||||
|
|
||||||
|
|
||||||
def _node_health(info):
|
|
||||||
status = info.get("status", "unknown")
|
|
||||||
if status == "offline":
|
|
||||||
return "error"
|
|
||||||
if info.get("disk_pressure") == "high":
|
|
||||||
return "degraded"
|
|
||||||
if status == "online":
|
|
||||||
return "nominal"
|
|
||||||
return status
|
|
||||||
|
|
||||||
|
|
||||||
def current_nodes():
|
|
||||||
"""Return nodes as a list of dicts shaped for the UI.
|
|
||||||
|
|
||||||
The observer stores nodes as a keyed dict {node_name: {...}}. The frontend
|
|
||||||
calls .map() which requires an array, so we convert here rather than change
|
|
||||||
the on-disk format (which the supervisor also reads).
|
|
||||||
"""
|
|
||||||
raw = read_json_file(WORLD_DIR / "nodes.json", default={})
|
|
||||||
if isinstance(raw, list):
|
|
||||||
return raw
|
|
||||||
result = []
|
|
||||||
for name, info in raw.items():
|
|
||||||
result.append({
|
|
||||||
"id": name,
|
|
||||||
"hostname": name,
|
|
||||||
"health": _node_health(info),
|
|
||||||
"status": info.get("status", "unknown"),
|
|
||||||
"capabilities": info.get("roles", []),
|
|
||||||
"connectivity": "tailscale",
|
|
||||||
"incidents": 0,
|
|
||||||
"last_seen": info.get("last_seen"),
|
|
||||||
"disk_usage_pct": info.get("disk_usage_pct"),
|
|
||||||
"mem_usage_pct": info.get("mem_usage_pct"),
|
|
||||||
"cpu_usage_pct": info.get("cpu_usage_pct"),
|
|
||||||
"disk_pressure": info.get("disk_pressure"),
|
|
||||||
})
|
|
||||||
return result
|
|
||||||
|
|
||||||
|
|
||||||
def current_services():
|
|
||||||
"""Return services as a list of dicts shaped for the UI.
|
|
||||||
|
|
||||||
Observer stores services as {"node/service": {...}}. Converted to a list
|
|
||||||
with the fields the services and topology views expect.
|
|
||||||
"""
|
|
||||||
raw = read_json_file(WORLD_DIR / "services.json", default={})
|
|
||||||
if isinstance(raw, list):
|
|
||||||
return raw
|
|
||||||
result = []
|
|
||||||
for key, info in raw.items():
|
|
||||||
svc_status = info.get("status", "unknown")
|
|
||||||
result.append({
|
|
||||||
"id": key,
|
|
||||||
"name": info.get("service", key),
|
|
||||||
"node": info.get("node", ""),
|
|
||||||
"health": ("nominal" if svc_status == "healthy"
|
|
||||||
else ("error" if svc_status == "unhealthy"
|
|
||||||
else svc_status)),
|
|
||||||
"desired_state": "running",
|
|
||||||
"actual_state": svc_status,
|
|
||||||
"deployment_state": "deployed",
|
|
||||||
"dependencies": [],
|
|
||||||
"recommendations": [],
|
|
||||||
"last_check": info.get("last_check"),
|
|
||||||
"incident_id": info.get("incident_id"),
|
|
||||||
})
|
|
||||||
return result
|
|
||||||
|
|
||||||
|
|
||||||
def current_deployments():
|
|
||||||
"""Return deployments as a list sorted newest-first."""
|
|
||||||
raw = read_json_file(WORLD_DIR / "deployments.json", default={})
|
|
||||||
if isinstance(raw, list):
|
|
||||||
return raw
|
|
||||||
result = []
|
|
||||||
for dep_id, info in raw.items():
|
|
||||||
result.append({
|
|
||||||
"id": dep_id,
|
|
||||||
"service": info.get("service", ""),
|
|
||||||
"node": info.get("node", ""),
|
|
||||||
"status": info.get("status", "unknown"),
|
|
||||||
"stage": info.get("status", "unknown"),
|
|
||||||
"diagnostics": info.get("last_error", ""),
|
|
||||||
"resumable": info.get("status") == "failed",
|
|
||||||
"started_at": info.get("started_at"),
|
|
||||||
"finished_at": info.get("finished_at"),
|
|
||||||
})
|
|
||||||
return sorted(result, key=lambda x: x.get("started_at") or 0, reverse=True)
|
|
||||||
|
|
||||||
|
|
||||||
def current_incidents():
|
|
||||||
"""Return active incidents as a list sorted most-recent-first.
|
|
||||||
|
|
||||||
Only incidents with status='active' are returned; resolved and cancelled
|
|
||||||
records are excluded so the dashboard reflects the current operational state.
|
|
||||||
"""
|
|
||||||
raw = read_json_file(WORLD_DIR / "incidents.json", default={})
|
|
||||||
if isinstance(raw, list):
|
|
||||||
return [i for i in raw if i.get("status") == "active"]
|
|
||||||
result = []
|
|
||||||
for inc in raw.values():
|
|
||||||
if inc.get("status") != "active":
|
|
||||||
continue
|
|
||||||
# Synthesise a human-readable message if not stored (observer doesn't set one).
|
|
||||||
if "message" not in inc:
|
|
||||||
inc = dict(inc)
|
|
||||||
inc["message"] = (
|
|
||||||
f"{inc.get('service', '?')} on {inc.get('node', '?')} "
|
|
||||||
f"is {inc.get('trigger_type', 'unhealthy')}"
|
|
||||||
)
|
|
||||||
result.append(inc)
|
|
||||||
return sorted(result, key=lambda x: x.get("last_occurrence") or 0, reverse=True)
|
|
||||||
|
|
||||||
|
|
||||||
def current_recommendations():
|
|
||||||
return read_json_file(WORLD_DIR / "recommendations.json")
|
|
||||||
|
|
||||||
|
|
||||||
def current_summary():
|
|
||||||
path = WORLD_DIR / "runtime-summary.json"
|
|
||||||
summary = read_json_file(path, default={})
|
|
||||||
if summary:
|
|
||||||
last_update_val = summary.get("last_update")
|
|
||||||
if last_update_val:
|
|
||||||
try:
|
|
||||||
if isinstance(last_update_val, str):
|
|
||||||
last_update = datetime.fromisoformat(last_update_val.replace('Z', '+00:00')).timestamp()
|
|
||||||
else:
|
|
||||||
last_update = float(last_update_val)
|
|
||||||
except Exception:
|
|
||||||
last_update = os.path.getmtime(path)
|
|
||||||
else:
|
|
||||||
last_update = os.path.getmtime(path)
|
|
||||||
summary["last_update"] = last_update
|
|
||||||
summary["stale"] = (time.time() - last_update) > 60
|
|
||||||
return summary
|
|
||||||
|
|
||||||
|
|
||||||
def _event_file_ts(p: Path) -> int:
|
|
||||||
"""Extract epoch timestamp from event filename: evt-<node>-<ts>-<type>-<svc>.json"""
|
|
||||||
m = _EVENT_TS_RE.search(p.stem)
|
|
||||||
return int(m.group(1)) if m else 0
|
|
||||||
|
|
||||||
|
|
||||||
def current_events():
|
|
||||||
"""Return the EVENTS_MAX_COUNT most-recent events, sorted newest-first.
|
|
||||||
|
|
||||||
Event files are named evt-<node>-<epoch>-<type>-<svc>.json. The directory
|
|
||||||
can contain hundreds of thousands of files (one file per event, written by
|
|
||||||
node-agent). Loading every file on each request causes catastrophic RSS
|
|
||||||
growth — 242 k files ≈ 420 MB of Python objects + 100 MB JSON serialisation.
|
|
||||||
|
|
||||||
Fix: use heapq.nlargest to stream through file paths (O(N_files) time,
|
|
||||||
O(EVENTS_MAX_COUNT) memory), extracting the epoch from the filename without
|
|
||||||
opening any file. Only the winning EVENTS_MAX_COUNT files are then read.
|
|
||||||
"""
|
|
||||||
if not EVENTS_DIR.exists():
|
|
||||||
return []
|
|
||||||
|
|
||||||
cutoff = time.time() - EVENTS_MAX_AGE_HOURS * 3600
|
|
||||||
|
|
||||||
# Stream all paths through a max-heap — never materialises the full list.
|
|
||||||
candidates = heapq.nlargest(
|
|
||||||
EVENTS_MAX_COUNT,
|
|
||||||
EVENTS_DIR.glob("**/*.json"),
|
|
||||||
key=_event_file_ts,
|
|
||||||
)
|
|
||||||
|
|
||||||
events = []
|
|
||||||
for f in candidates:
|
|
||||||
data = read_json_file(f)
|
|
||||||
if data and (data.get("timestamp") or 0) > cutoff:
|
|
||||||
data["_source"] = f.name
|
|
||||||
events.append(data)
|
|
||||||
|
|
||||||
return sorted(events, key=lambda x: x.get("timestamp") or 0, reverse=True)
|
|
||||||
|
|
||||||
|
|
||||||
def current_actions():
|
|
||||||
actions = {}
|
|
||||||
statuses = ["pending", "approved", "running", "completed", "failed", "rejected"]
|
|
||||||
for status in statuses:
|
|
||||||
actions[status] = []
|
|
||||||
status_dir = ACTIONS_DIR / status
|
|
||||||
if status_dir.exists():
|
|
||||||
for f in status_dir.glob("*.json"):
|
|
||||||
data = read_json_file(f)
|
|
||||||
if data:
|
|
||||||
# Injects some metadata for UI
|
|
||||||
data["id"] = data.get("action_id") or f.stem
|
|
||||||
data["status"] = status
|
|
||||||
actions[status].append(data)
|
|
||||||
return actions
|
|
||||||
|
|
||||||
|
|
||||||
def mutate_action(action_id, target_status):
|
|
||||||
statuses = ["pending", "approved", "running", "completed", "failed", "rejected"]
|
|
||||||
if target_status not in statuses:
|
|
||||||
return False, f"Invalid target status: {target_status}"
|
|
||||||
|
|
||||||
# Find where the action is
|
|
||||||
source_path = None
|
|
||||||
current_status = None
|
|
||||||
for status in statuses:
|
|
||||||
p = ACTIONS_DIR / status / f"{action_id}.json"
|
|
||||||
if p.exists():
|
|
||||||
source_path = p
|
|
||||||
current_status = status
|
|
||||||
break
|
|
||||||
|
|
||||||
if not source_path:
|
|
||||||
return False, f"Action {action_id} not found"
|
|
||||||
|
|
||||||
target_dir = ACTIONS_DIR / target_status
|
|
||||||
target_dir.mkdir(parents=True, exist_ok=True)
|
|
||||||
target_path = target_dir / f"{action_id}.json"
|
|
||||||
|
|
||||||
try:
|
|
||||||
data = json.loads(source_path.read_text())
|
|
||||||
data["status"] = target_status
|
|
||||||
data["updated_at"] = time.time()
|
|
||||||
|
|
||||||
# Keep history of transitions
|
|
||||||
history = data.get("transition_history", [])
|
|
||||||
history.append({
|
|
||||||
"from": current_status,
|
|
||||||
"to": target_status,
|
|
||||||
"timestamp": time.time()
|
|
||||||
})
|
|
||||||
data["transition_history"] = history
|
|
||||||
|
|
||||||
target_path.write_text(json.dumps(data, indent=2))
|
|
||||||
if source_path != target_path:
|
|
||||||
source_path.unlink()
|
|
||||||
return True, "Success"
|
|
||||||
except Exception as e:
|
|
||||||
return False, str(e)
|
|
||||||
|
|
||||||
|
|
||||||
def send_json(status, payload, handler):
|
|
||||||
body = (json.dumps(payload) + "\n").encode("utf-8")
|
|
||||||
handler.send_response(status)
|
|
||||||
handler.send_header("Content-Type", "application/json")
|
|
||||||
handler.send_header("Content-Length", str(len(body)))
|
|
||||||
handler.end_headers()
|
|
||||||
handler.wfile.write(body)
|
|
||||||
|
|
||||||
|
|
||||||
class Handler(BaseHTTPRequestHandler):
|
|
||||||
def do_GET(self):
|
|
||||||
if self.path == "/config":
|
|
||||||
send_json(200, get_config(), self)
|
|
||||||
return
|
|
||||||
|
|
||||||
if self.path == "/nodes":
|
|
||||||
send_json(200, current_nodes(), self)
|
|
||||||
return
|
|
||||||
|
|
||||||
if self.path == "/services":
|
|
||||||
send_json(200, current_services(), self)
|
|
||||||
return
|
|
||||||
|
|
||||||
if self.path == "/deployments":
|
|
||||||
send_json(200, current_deployments(), self)
|
|
||||||
return
|
|
||||||
|
|
||||||
if self.path == "/incidents":
|
|
||||||
send_json(200, current_incidents(), self)
|
|
||||||
return
|
|
||||||
|
|
||||||
if self.path == "/recommendations":
|
|
||||||
send_json(200, current_recommendations(), self)
|
|
||||||
return
|
|
||||||
|
|
||||||
if self.path == "/summary":
|
|
||||||
send_json(200, current_summary(), self)
|
|
||||||
return
|
|
||||||
|
|
||||||
if self.path == "/events":
|
|
||||||
send_json(200, current_events(), self)
|
|
||||||
return
|
|
||||||
|
|
||||||
if self.path == "/actions":
|
|
||||||
send_json(200, current_actions(), self)
|
|
||||||
return
|
|
||||||
|
|
||||||
if self.path in ("/", "/index.html"):
|
|
||||||
body = (STATIC_DIR / "index.html").read_bytes()
|
|
||||||
self.send_response(200)
|
|
||||||
self.send_header("Content-Type", "text/html; charset=utf-8")
|
|
||||||
self.send_header("Content-Length", str(len(body)))
|
|
||||||
self.end_headers()
|
|
||||||
self.wfile.write(body)
|
|
||||||
return
|
|
||||||
|
|
||||||
self.send_error(404)
|
|
||||||
|
|
||||||
def do_POST(self):
|
|
||||||
if self.path not in (
|
|
||||||
"/config",
|
|
||||||
"/action/mutate",
|
|
||||||
"/mode",
|
|
||||||
):
|
|
||||||
self.send_error(404)
|
|
||||||
return
|
|
||||||
|
|
||||||
length = int(self.headers.get("Content-Length", "0"))
|
|
||||||
raw_body = self.rfile.read(length).decode("utf-8")
|
|
||||||
try:
|
|
||||||
payload = json.loads(raw_body)
|
|
||||||
except json.JSONDecodeError:
|
|
||||||
self.send_error(400, "Invalid JSON")
|
|
||||||
return
|
|
||||||
|
|
||||||
if self.path == "/config":
|
|
||||||
config = get_config()
|
|
||||||
config.update(payload)
|
|
||||||
save_config(config)
|
|
||||||
send_json(200, {"status": "ok"}, self)
|
|
||||||
return
|
|
||||||
|
|
||||||
if self.path == "/mode":
|
|
||||||
mode = payload.get("mode")
|
|
||||||
if not mode:
|
|
||||||
self.send_error(400, "mode is required")
|
|
||||||
return
|
|
||||||
config = get_config()
|
|
||||||
config["operator_mode"] = mode
|
|
||||||
save_config(config)
|
|
||||||
send_json(200, {"status": "ok"}, self)
|
|
||||||
return
|
|
||||||
|
|
||||||
if self.path == "/action/mutate":
|
|
||||||
action_id = payload.get("id")
|
|
||||||
target = payload.get("status")
|
|
||||||
if not action_id or not target:
|
|
||||||
self.send_error(400, "id and status are required")
|
|
||||||
return
|
|
||||||
success, msg = mutate_action(action_id, target)
|
|
||||||
if success:
|
|
||||||
send_json(200, {"status": "ok"}, self)
|
|
||||||
else:
|
|
||||||
self.send_error(500, msg)
|
|
||||||
return
|
|
||||||
|
|
||||||
def log_message(self, format, *args):
|
|
||||||
return
|
|
||||||
|
|
||||||
|
|
||||||
class OperatorHTTPServer(ThreadingHTTPServer):
|
|
||||||
# Use daemon threads so finished request threads do not accumulate in the
|
|
||||||
# internal _threads list. ThreadingMixIn only tracks non-daemon threads
|
|
||||||
# (for joining at server_close); with daemon_threads=True that list stays
|
|
||||||
# empty, preventing unbounded growth of dead Thread objects over time.
|
|
||||||
daemon_threads = True
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
# Ensure directories exist
|
|
||||||
for d in [STATE_DIR, EVENTS_DIR, WORLD_DIR, ACTIONS_DIR, CONFIG_DIR]:
|
|
||||||
d.mkdir(parents=True, exist_ok=True)
|
|
||||||
for s in ["pending", "approved", "running", "completed", "failed", "rejected"]:
|
|
||||||
(ACTIONS_DIR / s).mkdir(parents=True, exist_ok=True)
|
|
||||||
|
|
||||||
port = int(os.getenv("PORT", "8080"))
|
|
||||||
print(f"Operator Control Plane starting on 0.0.0.0:{port}")
|
|
||||||
server = OperatorHTTPServer(("0.0.0.0", port), Handler)
|
|
||||||
server.serve_forever()
|
|
||||||
|
|
@ -1,771 +0,0 @@
|
||||||
import os
|
|
||||||
import json
|
|
||||||
import time
|
|
||||||
import logging
|
|
||||||
import yaml
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
|
|
||||||
def _atomic_write_json(path: Path, data) -> None:
|
|
||||||
"""Write JSON atomically: write to a sibling .tmp, fsync, then os.replace."""
|
|
||||||
tmp = path.with_suffix(".tmp")
|
|
||||||
with open(tmp, "w") as f:
|
|
||||||
json.dump(data, f, indent=2)
|
|
||||||
f.flush()
|
|
||||||
os.fsync(f.fileno())
|
|
||||||
os.replace(tmp, path)
|
|
||||||
|
|
||||||
# Constants and Paths
|
|
||||||
RUNTIME_PATH = os.getenv("RUNTIME_PATH", "/opt/homelab")
|
|
||||||
WORLD_DIR = Path(RUNTIME_PATH) / "world"
|
|
||||||
ACTIONS_DIR = Path(RUNTIME_PATH) / "actions"
|
|
||||||
EVENTS_DIR = Path(RUNTIME_PATH) / "events"
|
|
||||||
REPO_ROOT = Path(os.getenv("REPO_ROOT", "/repo"))
|
|
||||||
|
|
||||||
# Node alias map: maps alternative node names (as they appear in events/world state)
|
|
||||||
# to canonical topology node names (as they appear in hosts/*/services.yaml and topology.yaml).
|
|
||||||
# Override at runtime via NODE_ALIAS_MAP env var as a JSON string, e.g.:
|
|
||||||
# NODE_ALIAS_MAP='{"node-2": "chelsty", "node-1": "piha"}'
|
|
||||||
_NODE_ALIAS_ENV = os.getenv("NODE_ALIAS_MAP", "{}")
|
|
||||||
try:
|
|
||||||
NODE_ALIAS_MAP = json.loads(_NODE_ALIAS_ENV)
|
|
||||||
except Exception:
|
|
||||||
NODE_ALIAS_MAP = {}
|
|
||||||
|
|
||||||
# Event trigger types that should result in a lightweight container_restart
|
|
||||||
# rather than a full redeploy. The container is present but not running,
|
|
||||||
# or a dependency (MQTT) is unreachable — a restart is the right first step.
|
|
||||||
CONTAINER_RESTART_TRIGGERS = {"containers_not_running", "mqtt_unreachable"}
|
|
||||||
|
|
||||||
# Nodes where automatic disk_cleanup actions must NOT be generated.
|
|
||||||
# On chelsty nodes disk fullness is overwhelmingly caused by Frigate recordings
|
|
||||||
# or the HA database — Docker cleanup will not help and the operator must
|
|
||||||
# decide explicitly (e.g. adjust Frigate retain policy or purge HA recorder).
|
|
||||||
NO_DISK_CLEANUP_NODES = {"chelsty-infra", "chelsty-ha"}
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# HA diagnostic event routing (ha-diag-agent events)
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
# ha_websocket_dead: HA WebSocket unresponsive → restart the homeassistant container.
|
|
||||||
# Separate from CONTAINER_RESTART_TRIGGERS because these events are routed directly
|
|
||||||
# from the events dir (not via the world-state drift loop) to avoid conflicts with
|
|
||||||
# the stability-agent's independent container health tracking on the same service key.
|
|
||||||
HA_CONTAINER_RESTART_EVENTS = {"ha_websocket_dead"}
|
|
||||||
|
|
||||||
# Alert-only events — operator notification, no automated action.
|
|
||||||
HA_ALERT_ONLY_EVENTS = {
|
|
||||||
"ha_integration_failed",
|
|
||||||
"ha_entity_unavailable_long",
|
|
||||||
"ha_automation_failing",
|
|
||||||
"ha_update_available",
|
|
||||||
"ha_recorder_lag",
|
|
||||||
"ha_system_health_degraded",
|
|
||||||
}
|
|
||||||
|
|
||||||
# Stable action-ID suffix for each alert-only type
|
|
||||||
_HA_ALERT_ID_SUFFIX = {
|
|
||||||
"ha_integration_failed": "integration-failed",
|
|
||||||
"ha_entity_unavailable_long": "entity-unavailable",
|
|
||||||
"ha_automation_failing": "automation-failing",
|
|
||||||
"ha_update_available": "update-available",
|
|
||||||
"ha_recorder_lag": "recorder-lag",
|
|
||||||
"ha_system_health_degraded": "system-health-degraded",
|
|
||||||
}
|
|
||||||
|
|
||||||
# 30-min cooldown after a container_restart completes; prevents restart loops
|
|
||||||
# when HA repeatedly fails to connect (e.g. bad config, slow startup).
|
|
||||||
HA_WEBSOCKET_RESTART_COOLDOWN = 1800
|
|
||||||
|
|
||||||
# 1-hour cooldown for alert-only events; avoids repeated Telegram noise for
|
|
||||||
# persistent conditions (e.g. an entity that stays unavailable for hours).
|
|
||||||
HA_ALERT_COOLDOWN = 3600
|
|
||||||
|
|
||||||
# Suppress ha_* events if homeassistant had a containers_not_running incident
|
|
||||||
# within this window — HA is in a planned restart/update and alerts would be noise.
|
|
||||||
HA_TRANSITION_WINDOW = 300 # 5 minutes
|
|
||||||
|
|
||||||
# When True, events that would generate container_restart are downgraded to alert_only
|
|
||||||
# with a "[SHADOW MODE]" note. Safe default for initial deployment; set
|
|
||||||
# HA_DIAG_SHADOW_MODE=false on the control-plane node when ready for live actions.
|
|
||||||
HA_DIAG_SHADOW_MODE = os.getenv("HA_DIAG_SHADOW_MODE", "true").lower() == "true"
|
|
||||||
|
|
||||||
# Logging setup
|
|
||||||
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
|
|
||||||
logger = logging.getLogger("supervisor")
|
|
||||||
|
|
||||||
|
|
||||||
class Supervisor:
|
|
||||||
def __init__(self):
|
|
||||||
self.desired_state = {"services": {}}
|
|
||||||
self.actual_state = {"services": {}, "nodes": {}, "incidents": {}}
|
|
||||||
# In-memory set of already-routed HA event IDs; prevents re-processing
|
|
||||||
# on each reconcile cycle. Grows to at most ~hundreds of entries/day.
|
|
||||||
self._ha_processed_event_ids: set = set()
|
|
||||||
self._ensure_dirs()
|
|
||||||
logger.info(
|
|
||||||
"shadow_mode=%s — HA container_restart actions %s",
|
|
||||||
HA_DIAG_SHADOW_MODE,
|
|
||||||
"downgraded to alert_only" if HA_DIAG_SHADOW_MODE else "enabled",
|
|
||||||
)
|
|
||||||
|
|
||||||
def _ensure_dirs(self):
|
|
||||||
ACTIONS_DIR.mkdir(parents=True, exist_ok=True)
|
|
||||||
(ACTIONS_DIR / "pending").mkdir(parents=True, exist_ok=True)
|
|
||||||
|
|
||||||
# ------------------------------------------------------------------
|
|
||||||
# Node name resolution
|
|
||||||
# ------------------------------------------------------------------
|
|
||||||
|
|
||||||
def _resolve_node(self, name):
|
|
||||||
"""Resolve an event/world-state node name to its canonical topology name."""
|
|
||||||
return NODE_ALIAS_MAP.get(name, name)
|
|
||||||
|
|
||||||
# ------------------------------------------------------------------
|
|
||||||
# Container name lookup
|
|
||||||
# ------------------------------------------------------------------
|
|
||||||
|
|
||||||
def _get_container_name(self, service):
|
|
||||||
"""
|
|
||||||
Determine the Docker container name for a service.
|
|
||||||
Parses container_name from the service's docker-compose.yml.
|
|
||||||
Falls back to the service name if not found.
|
|
||||||
"""
|
|
||||||
compose_path = REPO_ROOT / "services" / service / "docker-compose.yml"
|
|
||||||
if compose_path.exists():
|
|
||||||
try:
|
|
||||||
with open(compose_path, "r") as f:
|
|
||||||
compose = yaml.safe_load(f)
|
|
||||||
for svc_block in compose.get("services", {}).values():
|
|
||||||
cname = svc_block.get("container_name")
|
|
||||||
if cname:
|
|
||||||
return cname
|
|
||||||
except Exception as e:
|
|
||||||
logger.warning(f"Could not parse docker-compose for {service}: {e}")
|
|
||||||
# Convention: container name matches service name
|
|
||||||
return service
|
|
||||||
|
|
||||||
# ------------------------------------------------------------------
|
|
||||||
# State loading
|
|
||||||
# ------------------------------------------------------------------
|
|
||||||
|
|
||||||
def _load_desired_state(self):
|
|
||||||
services = {}
|
|
||||||
hosts_dir = REPO_ROOT / "hosts"
|
|
||||||
if not hosts_dir.exists():
|
|
||||||
logger.warning(f"Hosts directory {hosts_dir} does not exist")
|
|
||||||
return
|
|
||||||
|
|
||||||
for host_dir in hosts_dir.iterdir():
|
|
||||||
if host_dir.is_dir():
|
|
||||||
svc_file = host_dir / "services.yaml"
|
|
||||||
if svc_file.exists():
|
|
||||||
try:
|
|
||||||
with open(svc_file, "r") as f:
|
|
||||||
data = yaml.safe_load(f)
|
|
||||||
host_name = data.get("host")
|
|
||||||
for svc_name, svc_info in data.get("services", {}).items():
|
|
||||||
svc_info = svc_info or {}
|
|
||||||
# monitor: false — service is documented as desired but
|
|
||||||
# intentionally excluded from supervisor action generation.
|
|
||||||
# Use this when a service is not yet bootstrapped on an
|
|
||||||
# offline/LTE node so the queue stays clean until it is.
|
|
||||||
if svc_info.get("monitor") is False:
|
|
||||||
logger.debug(
|
|
||||||
f"Skipping {host_name}/{svc_name}: monitor=false"
|
|
||||||
)
|
|
||||||
continue
|
|
||||||
svc_key = f"{host_name}/{svc_name}"
|
|
||||||
services[svc_key] = {
|
|
||||||
"node": host_name,
|
|
||||||
"service": svc_name,
|
|
||||||
"desired": "running"
|
|
||||||
}
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Failed to load {svc_file}: {e}")
|
|
||||||
self.desired_state["services"] = services
|
|
||||||
|
|
||||||
def _load_actual_state(self) -> bool:
|
|
||||||
"""Load world state from disk. Returns False if any file is unreadable
|
|
||||||
(empty / mid-write truncation), in which case actual_state is NOT updated
|
|
||||||
so the caller can skip this reconcile cycle rather than treating missing
|
|
||||||
data as a real drift signal."""
|
|
||||||
files = {
|
|
||||||
"services": WORLD_DIR / "services.json",
|
|
||||||
"nodes": WORLD_DIR / "nodes.json",
|
|
||||||
"incidents": WORLD_DIR / "incidents.json"
|
|
||||||
}
|
|
||||||
raw = {}
|
|
||||||
for key, path in files.items():
|
|
||||||
if path.exists():
|
|
||||||
try:
|
|
||||||
with open(path, "r") as f:
|
|
||||||
raw[key] = json.load(f)
|
|
||||||
except Exception as e:
|
|
||||||
logger.warning(
|
|
||||||
f"World state {path.name} unreadable (truncated write?): {e} "
|
|
||||||
f"— skipping reconcile cycle, keeping last known state"
|
|
||||||
)
|
|
||||||
return False
|
|
||||||
else:
|
|
||||||
raw[key] = {}
|
|
||||||
|
|
||||||
# Normalize node names in services using alias map so that
|
|
||||||
# event-sourced names (e.g. "node-2") resolve to canonical
|
|
||||||
# topology names (e.g. "chelsty") before comparison with desired state.
|
|
||||||
normalized_services = {}
|
|
||||||
for svc_key, svc_info in raw.get("services", {}).items():
|
|
||||||
svc_info = dict(svc_info)
|
|
||||||
raw_node = svc_info.get("node", "")
|
|
||||||
canonical_node = self._resolve_node(raw_node)
|
|
||||||
if canonical_node != raw_node:
|
|
||||||
logger.debug(f"Resolved node alias: {raw_node} → {canonical_node}")
|
|
||||||
svc_info["node"] = canonical_node
|
|
||||||
svc_name = svc_info.get("service") or svc_key.split("/", 1)[-1]
|
|
||||||
svc_key = f"{canonical_node}/{svc_name}"
|
|
||||||
normalized_services[svc_key] = svc_info
|
|
||||||
|
|
||||||
# Normalize node names in incidents as well
|
|
||||||
normalized_incidents = {}
|
|
||||||
for inc_id, inc in raw.get("incidents", {}).items():
|
|
||||||
inc = dict(inc)
|
|
||||||
raw_node = inc.get("node", "")
|
|
||||||
inc["node"] = self._resolve_node(raw_node)
|
|
||||||
normalized_incidents[inc_id] = inc
|
|
||||||
|
|
||||||
self.actual_state["services"] = normalized_services
|
|
||||||
self.actual_state["nodes"] = raw.get("nodes", {})
|
|
||||||
self.actual_state["incidents"] = normalized_incidents
|
|
||||||
return True
|
|
||||||
|
|
||||||
# ------------------------------------------------------------------
|
|
||||||
# Incident helpers
|
|
||||||
# ------------------------------------------------------------------
|
|
||||||
|
|
||||||
def _get_incident_trigger(self, svc_key):
|
|
||||||
"""
|
|
||||||
Return the trigger_type of the active incident for a service, or None.
|
|
||||||
trigger_type is set by the observer when it creates an incident from
|
|
||||||
a specific event type (e.g. 'containers_not_running', 'mqtt_unreachable').
|
|
||||||
"""
|
|
||||||
svc_info = self.actual_state["services"].get(svc_key, {})
|
|
||||||
incident_id = svc_info.get("incident_id")
|
|
||||||
if not incident_id:
|
|
||||||
return None
|
|
||||||
incident = self.actual_state["incidents"].get(incident_id, {})
|
|
||||||
if incident.get("status") == "active":
|
|
||||||
return incident.get("trigger_type")
|
|
||||||
return None
|
|
||||||
|
|
||||||
# ------------------------------------------------------------------
|
|
||||||
# Reconciliation loop
|
|
||||||
# ------------------------------------------------------------------
|
|
||||||
|
|
||||||
def reconcile(self):
|
|
||||||
# Update heartbeat
|
|
||||||
heartbeat_file = WORLD_DIR.parent / "state" / "supervisor.heartbeat"
|
|
||||||
try:
|
|
||||||
heartbeat_file.touch()
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Failed to touch heartbeat file: {e}")
|
|
||||||
|
|
||||||
self._load_desired_state()
|
|
||||||
if not self._load_actual_state():
|
|
||||||
return # world state unreadable this cycle — skip to avoid false drift
|
|
||||||
|
|
||||||
drifts = []
|
|
||||||
|
|
||||||
# 1. Check for missing or unhealthy services
|
|
||||||
for svc_key, desired_info in self.desired_state["services"].items():
|
|
||||||
actual_info = self.actual_state["services"].get(svc_key)
|
|
||||||
|
|
||||||
if not actual_info:
|
|
||||||
drifts.append({
|
|
||||||
"type": "missing_service",
|
|
||||||
"svc_key": svc_key,
|
|
||||||
"node": desired_info["node"],
|
|
||||||
"service": desired_info["service"],
|
|
||||||
"trigger_type": None,
|
|
||||||
})
|
|
||||||
elif actual_info.get("status") != "healthy":
|
|
||||||
trigger_type = self._get_incident_trigger(svc_key)
|
|
||||||
drifts.append({
|
|
||||||
"type": "unhealthy_service",
|
|
||||||
"svc_key": svc_key,
|
|
||||||
"node": desired_info["node"],
|
|
||||||
"service": desired_info["service"],
|
|
||||||
"status": actual_info.get("status"),
|
|
||||||
"trigger_type": trigger_type,
|
|
||||||
})
|
|
||||||
|
|
||||||
# 2. Generate service-level recommendations
|
|
||||||
for drift in drifts:
|
|
||||||
self._generate_recommendation(drift)
|
|
||||||
|
|
||||||
# 3. Generate node-level recommendations (disk pressure)
|
|
||||||
for node_name, node_info in self.actual_state["nodes"].items():
|
|
||||||
if node_name in NO_DISK_CLEANUP_NODES:
|
|
||||||
continue
|
|
||||||
if node_info.get("disk_pressure") == "high":
|
|
||||||
self._generate_disk_cleanup_recommendation(node_name)
|
|
||||||
|
|
||||||
# 4. Cancel pending actions whose drift has been resolved.
|
|
||||||
# When a service becomes healthy again (because node-agent emits
|
|
||||||
# service_healthy and the observer updates services.json), any
|
|
||||||
# previously queued redeploy/container_restart action for that
|
|
||||||
# service is no longer needed. Move it to "cancelled/" so the
|
|
||||||
# operator can see it was auto-resolved rather than silently dropped.
|
|
||||||
self._cancel_resolved_pending_actions()
|
|
||||||
|
|
||||||
# 5. Route HA diagnostic events emitted by ha-diag-agent.
|
|
||||||
# Processed directly from the events directory — not via the world-state
|
|
||||||
# drift loop — to avoid conflicts with stability-agent's independent
|
|
||||||
# container health tracking for the homeassistant service.
|
|
||||||
self._process_ha_events()
|
|
||||||
|
|
||||||
# ------------------------------------------------------------------
|
|
||||||
# Recommendation generation
|
|
||||||
# ------------------------------------------------------------------
|
|
||||||
|
|
||||||
def _generate_recommendation(self, drift):
|
|
||||||
node = drift["node"]
|
|
||||||
service = drift["service"]
|
|
||||||
trigger_type = drift.get("trigger_type")
|
|
||||||
|
|
||||||
# Choose action type first so we can build the stable, deterministic ID.
|
|
||||||
# Stable IDs mean reconcile is truly idempotent: the same drift always
|
|
||||||
# produces the same filename, so we never create duplicates even across
|
|
||||||
# restarts of the supervisor.
|
|
||||||
if trigger_type in CONTAINER_RESTART_TRIGGERS:
|
|
||||||
action_id = f"container-restart-{node}-{service}"
|
|
||||||
else:
|
|
||||||
action_id = f"redeploy-{node}-{service}"
|
|
||||||
|
|
||||||
# Skip if an action for this ID is already live in any active state
|
|
||||||
# (pending → approved → running). This prevents re-creation after
|
|
||||||
# a human approves an action that hasn't executed yet.
|
|
||||||
for state in ("pending", "approved", "running"):
|
|
||||||
if (ACTIONS_DIR / state / f"{action_id}.json").exists():
|
|
||||||
logger.debug(f"Skipping {action_id}: already in state '{state}'")
|
|
||||||
return
|
|
||||||
|
|
||||||
if trigger_type in CONTAINER_RESTART_TRIGGERS:
|
|
||||||
# Lightweight remediation: the container exists but is not running
|
|
||||||
# (containers_not_running) or its MQTT dependency is unreachable
|
|
||||||
# (mqtt_unreachable). A docker restart is sufficient and low-risk.
|
|
||||||
container_name = self._get_container_name(service)
|
|
||||||
action = {
|
|
||||||
"action_id": action_id,
|
|
||||||
"timestamp": time.time(),
|
|
||||||
"type": "container_restart",
|
|
||||||
"node": node,
|
|
||||||
"service": service,
|
|
||||||
"container_name": container_name,
|
|
||||||
"risk_level": "low",
|
|
||||||
"confidence": 0.95,
|
|
||||||
"description": (
|
|
||||||
f"Restart container '{container_name}' on {node} "
|
|
||||||
f"(service: {service}, reason: {trigger_type})"
|
|
||||||
),
|
|
||||||
"status": "pending",
|
|
||||||
"payload": {
|
|
||||||
"reason": trigger_type,
|
|
||||||
"svc_key": drift["svc_key"],
|
|
||||||
},
|
|
||||||
}
|
|
||||||
else:
|
|
||||||
# Full redeploy: container is running but service is broken,
|
|
||||||
# or the cause is unknown / not a simple restart candidate.
|
|
||||||
action = {
|
|
||||||
"action_id": action_id,
|
|
||||||
"timestamp": time.time(),
|
|
||||||
"type": "redeploy",
|
|
||||||
"node": node,
|
|
||||||
"service": service,
|
|
||||||
"risk_level": "guarded",
|
|
||||||
"confidence": 0.9,
|
|
||||||
"description": f"Redeploy {service} on {node} due to {drift['type']}",
|
|
||||||
"status": "pending",
|
|
||||||
"payload": {
|
|
||||||
"reason": drift["type"],
|
|
||||||
"svc_key": drift["svc_key"],
|
|
||||||
},
|
|
||||||
}
|
|
||||||
|
|
||||||
action_path = ACTIONS_DIR / "pending" / f"{action_id}.json"
|
|
||||||
try:
|
|
||||||
_atomic_write_json(action_path, action)
|
|
||||||
logger.info(
|
|
||||||
f"Generated recommendation: {action_id} "
|
|
||||||
f"(type={action['type']}, risk={action['risk_level']})"
|
|
||||||
)
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Failed to save recommendation {action_id}: {e}")
|
|
||||||
|
|
||||||
def _generate_disk_cleanup_recommendation(self, node: str):
|
|
||||||
"""
|
|
||||||
Generate a disk_cleanup action when node-agent reports critical disk
|
|
||||||
pressure (>85 %) on a node that supports automated Docker cleanup.
|
|
||||||
|
|
||||||
This is an OPERATOR-APPROVED action (risk=guarded): it runs
|
|
||||||
`docker image prune -a -f` and `docker volume prune -f`, which are
|
|
||||||
more aggressive than the safe auto-cleanup the node-agent runs itself.
|
|
||||||
|
|
||||||
Nodes in NO_DISK_CLEANUP_NODES never reach this method (filtered in
|
|
||||||
reconcile) because their disk fullness is caused by application data
|
|
||||||
(Frigate, HA) that the operator must handle manually.
|
|
||||||
"""
|
|
||||||
action_id = f"disk-cleanup-{node}"
|
|
||||||
|
|
||||||
for state in ("pending", "approved", "running"):
|
|
||||||
if (ACTIONS_DIR / state / f"{action_id}.json").exists():
|
|
||||||
logger.debug(f"Skipping {action_id}: already in state '{state}'")
|
|
||||||
return
|
|
||||||
|
|
||||||
action = {
|
|
||||||
"action_id": action_id,
|
|
||||||
"timestamp": time.time(),
|
|
||||||
"type": "disk_cleanup",
|
|
||||||
"node": node,
|
|
||||||
"service": "",
|
|
||||||
"risk_level": "guarded",
|
|
||||||
"confidence": 0.85,
|
|
||||||
"description": (
|
|
||||||
f"Aggressive disk cleanup on {node}: docker image prune -a "
|
|
||||||
f"and docker volume prune (requires operator approval)"
|
|
||||||
),
|
|
||||||
"status": "pending",
|
|
||||||
"payload": {
|
|
||||||
"reason": "disk_pressure",
|
|
||||||
"commands": [
|
|
||||||
"docker image prune -a -f",
|
|
||||||
"docker volume prune -f",
|
|
||||||
],
|
|
||||||
},
|
|
||||||
}
|
|
||||||
|
|
||||||
action_path = ACTIONS_DIR / "pending" / f"{action_id}.json"
|
|
||||||
try:
|
|
||||||
_atomic_write_json(action_path, action)
|
|
||||||
logger.info(
|
|
||||||
f"Generated disk cleanup recommendation: {action_id} "
|
|
||||||
f"(node={node}, risk=guarded)"
|
|
||||||
)
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Failed to save disk cleanup recommendation {action_id}: {e}")
|
|
||||||
|
|
||||||
def _cancel_resolved_pending_actions(self):
|
|
||||||
"""
|
|
||||||
Auto-cancel pending service actions (redeploy / container_restart) whose
|
|
||||||
target service is now healthy in the actual state.
|
|
||||||
|
|
||||||
This keeps the action queue clean: when node-agent starts reporting
|
|
||||||
service_healthy for a container that previously had no world-state entry,
|
|
||||||
the pending 'missing_service' redeploy action that was generated before
|
|
||||||
the first health confirmation should be removed automatically rather than
|
|
||||||
sitting in the queue until an operator manually rejects it.
|
|
||||||
|
|
||||||
Only pending actions are considered — approved/running actions have already
|
|
||||||
been committed to by the operator and must not be cancelled automatically.
|
|
||||||
"""
|
|
||||||
cancelled_dir = ACTIONS_DIR / "cancelled"
|
|
||||||
cancelled_dir.mkdir(parents=True, exist_ok=True)
|
|
||||||
pending_dir = ACTIONS_DIR / "pending"
|
|
||||||
if not pending_dir.exists():
|
|
||||||
return
|
|
||||||
|
|
||||||
for action_file in list(pending_dir.glob("*.json")):
|
|
||||||
try:
|
|
||||||
with open(action_file, "r") as f:
|
|
||||||
action = json.load(f)
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Failed to read action {action_file.name}: {e}")
|
|
||||||
continue
|
|
||||||
|
|
||||||
action_type = action.get("type")
|
|
||||||
node = action.get("node")
|
|
||||||
service = action.get("service")
|
|
||||||
|
|
||||||
# Only auto-cancel service-level actions (not disk_cleanup)
|
|
||||||
if action_type not in ("redeploy", "container_restart"):
|
|
||||||
continue
|
|
||||||
if not node or not service:
|
|
||||||
continue
|
|
||||||
|
|
||||||
svc_key = f"{node}/{service}"
|
|
||||||
|
|
||||||
cancel_reason = None
|
|
||||||
|
|
||||||
# Case 1: service is no longer in desired state (removed from services.yaml
|
|
||||||
# or marked monitor:false). The action was generated under old config.
|
|
||||||
if svc_key not in self.desired_state["services"]:
|
|
||||||
cancel_reason = "service_removed_from_desired_state"
|
|
||||||
|
|
||||||
# Case 2: drift resolved — service is now healthy in actual state.
|
|
||||||
elif self.actual_state["services"].get(svc_key, {}).get("status") == "healthy":
|
|
||||||
cancel_reason = "drift_resolved_auto"
|
|
||||||
|
|
||||||
if cancel_reason:
|
|
||||||
dest = cancelled_dir / action_file.name
|
|
||||||
try:
|
|
||||||
action["status"] = "cancelled"
|
|
||||||
action["cancelled_reason"] = cancel_reason
|
|
||||||
action["cancelled_at"] = time.time()
|
|
||||||
_atomic_write_json(dest, action)
|
|
||||||
action_file.unlink()
|
|
||||||
logger.info(
|
|
||||||
f"Auto-cancelled {action_file.name}: "
|
|
||||||
f"{svc_key} — {cancel_reason}"
|
|
||||||
)
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Failed to cancel action {action_file.name}: {e}")
|
|
||||||
|
|
||||||
# ------------------------------------------------------------------
|
|
||||||
# HA diagnostic event routing
|
|
||||||
# ------------------------------------------------------------------
|
|
||||||
|
|
||||||
def _process_ha_events(self):
|
|
||||||
"""Scan the events directory for unprocessed ha_* events and route them."""
|
|
||||||
if not EVENTS_DIR.exists():
|
|
||||||
return
|
|
||||||
for event_file in sorted(EVENTS_DIR.glob("**/*.json")):
|
|
||||||
event_id = event_file.stem
|
|
||||||
if event_id in self._ha_processed_event_ids:
|
|
||||||
continue
|
|
||||||
self._ha_processed_event_ids.add(event_id)
|
|
||||||
try:
|
|
||||||
with open(event_file) as f:
|
|
||||||
event = json.load(f)
|
|
||||||
except Exception as e:
|
|
||||||
logger.debug(f"Could not read event {event_file}: {e}")
|
|
||||||
continue
|
|
||||||
if not event.get("type", "").startswith("ha_"):
|
|
||||||
continue
|
|
||||||
self._route_ha_event(event)
|
|
||||||
|
|
||||||
def _route_ha_event(self, event: dict):
|
|
||||||
event_type = event.get("type", "")
|
|
||||||
node = event.get("node", "")
|
|
||||||
if not node:
|
|
||||||
return
|
|
||||||
|
|
||||||
if event_type in HA_CONTAINER_RESTART_EVENTS:
|
|
||||||
if self._is_ha_in_transition(node):
|
|
||||||
logger.debug(
|
|
||||||
f"Suppressing {event_type} on {node}: homeassistant in transition"
|
|
||||||
)
|
|
||||||
return
|
|
||||||
if HA_DIAG_SHADOW_MODE:
|
|
||||||
logger.info(
|
|
||||||
"shadow_mode: suppressed container_restart for %s", event_type
|
|
||||||
)
|
|
||||||
self._generate_ha_shadow_alert(node, event)
|
|
||||||
else:
|
|
||||||
self._generate_ha_container_restart(node, event)
|
|
||||||
|
|
||||||
elif event_type == "ha_websocket_recovered":
|
|
||||||
self._cancel_ha_container_restart(node)
|
|
||||||
|
|
||||||
elif event_type in HA_ALERT_ONLY_EVENTS:
|
|
||||||
if self._is_ha_in_transition(node):
|
|
||||||
logger.debug(
|
|
||||||
f"Suppressing {event_type} on {node}: homeassistant in transition"
|
|
||||||
)
|
|
||||||
return
|
|
||||||
self._generate_ha_alert_only(node, event)
|
|
||||||
|
|
||||||
def _is_ha_in_transition(self, node: str) -> bool:
|
|
||||||
"""Return True if homeassistant container had a recent containers_not_running incident.
|
|
||||||
|
|
||||||
Suppresses ha_* alerts during planned HA restarts/updates to avoid
|
|
||||||
flooding the operator with secondary diagnostic alerts.
|
|
||||||
"""
|
|
||||||
svc_key = f"{node}/homeassistant"
|
|
||||||
svc_info = self.actual_state["services"].get(svc_key, {})
|
|
||||||
incident_id = svc_info.get("incident_id")
|
|
||||||
if not incident_id:
|
|
||||||
return False
|
|
||||||
incident = self.actual_state["incidents"].get(incident_id, {})
|
|
||||||
return (
|
|
||||||
incident.get("status") == "active"
|
|
||||||
and incident.get("trigger_type") == "containers_not_running"
|
|
||||||
and time.time() - (incident.get("last_occurrence") or 0) < HA_TRANSITION_WINDOW
|
|
||||||
)
|
|
||||||
|
|
||||||
def _ha_action_recently_completed(self, action_id: str, cooldown: int) -> bool:
|
|
||||||
"""Return True if action completed/rejected/cancelled within the cooldown window."""
|
|
||||||
for state in ("completed", "rejected", "cancelled"):
|
|
||||||
path = ACTIONS_DIR / state / f"{action_id}.json"
|
|
||||||
if path.exists():
|
|
||||||
try:
|
|
||||||
with open(path) as f:
|
|
||||||
data = json.load(f)
|
|
||||||
finished = (
|
|
||||||
data.get("finished_at")
|
|
||||||
or data.get("cancelled_at")
|
|
||||||
or data.get("updated_at")
|
|
||||||
or 0
|
|
||||||
)
|
|
||||||
if time.time() - finished < cooldown:
|
|
||||||
return True
|
|
||||||
except Exception:
|
|
||||||
pass
|
|
||||||
return False
|
|
||||||
|
|
||||||
def _generate_ha_container_restart(self, node: str, event: dict):
|
|
||||||
service = "homeassistant"
|
|
||||||
action_id = f"container-restart-{node}-{service}"
|
|
||||||
|
|
||||||
for state in ("pending", "approved", "running"):
|
|
||||||
if (ACTIONS_DIR / state / f"{action_id}.json").exists():
|
|
||||||
logger.debug(f"Skipping {action_id}: already in state '{state}'")
|
|
||||||
return
|
|
||||||
|
|
||||||
if self._ha_action_recently_completed(action_id, HA_WEBSOCKET_RESTART_COOLDOWN):
|
|
||||||
logger.debug(
|
|
||||||
f"Skipping {action_id}: within {HA_WEBSOCKET_RESTART_COOLDOWN}s cooldown"
|
|
||||||
)
|
|
||||||
return
|
|
||||||
|
|
||||||
payload = dict(event.get("payload", {}))
|
|
||||||
payload["reason"] = "ha_websocket_dead"
|
|
||||||
payload["svc_key"] = f"{node}/{service}"
|
|
||||||
|
|
||||||
container_name = self._get_container_name(service)
|
|
||||||
action = {
|
|
||||||
"action_id": action_id,
|
|
||||||
"timestamp": time.time(),
|
|
||||||
"type": "container_restart",
|
|
||||||
"node": node,
|
|
||||||
"service": service,
|
|
||||||
"container_name": container_name,
|
|
||||||
"risk_level": "low",
|
|
||||||
"confidence": 0.9,
|
|
||||||
"description": (
|
|
||||||
f"Restart '{container_name}' on {node}: HA WebSocket unresponsive"
|
|
||||||
),
|
|
||||||
"status": "pending",
|
|
||||||
"payload": payload,
|
|
||||||
}
|
|
||||||
self._write_pending_action(action)
|
|
||||||
|
|
||||||
def _generate_ha_shadow_alert(self, node: str, event: dict):
|
|
||||||
"""Shadow-mode downgrade: emit alert_only instead of container_restart.
|
|
||||||
|
|
||||||
Uses the same action_id and cooldown as the real restart so that
|
|
||||||
cooldown semantics are identical regardless of shadow mode state.
|
|
||||||
"""
|
|
||||||
service = "homeassistant"
|
|
||||||
action_id = f"container-restart-{node}-{service}"
|
|
||||||
|
|
||||||
for state in ("pending", "approved", "running"):
|
|
||||||
if (ACTIONS_DIR / state / f"{action_id}.json").exists():
|
|
||||||
logger.debug(f"Skipping {action_id}: already in state '{state}'")
|
|
||||||
return
|
|
||||||
|
|
||||||
if self._ha_action_recently_completed(action_id, HA_WEBSOCKET_RESTART_COOLDOWN):
|
|
||||||
logger.debug(
|
|
||||||
f"Skipping {action_id}: within {HA_WEBSOCKET_RESTART_COOLDOWN}s cooldown"
|
|
||||||
)
|
|
||||||
return
|
|
||||||
|
|
||||||
payload = dict(event.get("payload", {}))
|
|
||||||
payload["reason"] = "ha_websocket_dead"
|
|
||||||
payload["svc_key"] = f"{node}/{service}"
|
|
||||||
payload["shadow_mode"] = True
|
|
||||||
|
|
||||||
action = {
|
|
||||||
"action_id": action_id,
|
|
||||||
"timestamp": time.time(),
|
|
||||||
"type": "alert_only",
|
|
||||||
"node": node,
|
|
||||||
"service": service,
|
|
||||||
"risk_level": "info",
|
|
||||||
"confidence": 0.9,
|
|
||||||
"description": (
|
|
||||||
f"[SHADOW MODE] would have triggered container_restart "
|
|
||||||
f"for {service} on {node}: HA WebSocket unresponsive"
|
|
||||||
),
|
|
||||||
"status": "pending",
|
|
||||||
"payload": payload,
|
|
||||||
}
|
|
||||||
self._write_pending_action(action)
|
|
||||||
|
|
||||||
def _generate_ha_alert_only(self, node: str, event: dict):
|
|
||||||
event_type = event.get("type", "")
|
|
||||||
suffix = _HA_ALERT_ID_SUFFIX.get(event_type, event_type.replace("_", "-"))
|
|
||||||
action_id = f"alert-ha-{suffix}-{node}"
|
|
||||||
|
|
||||||
for state in ("pending", "approved", "running"):
|
|
||||||
if (ACTIONS_DIR / state / f"{action_id}.json").exists():
|
|
||||||
logger.debug(f"Skipping {action_id}: already in state '{state}'")
|
|
||||||
return
|
|
||||||
|
|
||||||
if self._ha_action_recently_completed(action_id, HA_ALERT_COOLDOWN):
|
|
||||||
logger.debug(
|
|
||||||
f"Skipping {action_id}: within {HA_ALERT_COOLDOWN}s cooldown"
|
|
||||||
)
|
|
||||||
return
|
|
||||||
|
|
||||||
payload = dict(event.get("payload", {}))
|
|
||||||
payload["reason"] = event_type
|
|
||||||
|
|
||||||
action = {
|
|
||||||
"action_id": action_id,
|
|
||||||
"timestamp": time.time(),
|
|
||||||
"type": "alert_only",
|
|
||||||
"node": node,
|
|
||||||
"service": event.get("service", "homeassistant"),
|
|
||||||
"risk_level": "info",
|
|
||||||
"confidence": 1.0,
|
|
||||||
"description": event.get(
|
|
||||||
"message", f"HA diagnostic alert: {event_type} on {node}"
|
|
||||||
),
|
|
||||||
"status": "pending",
|
|
||||||
"payload": payload,
|
|
||||||
}
|
|
||||||
self._write_pending_action(action)
|
|
||||||
|
|
||||||
def _cancel_ha_container_restart(self, node: str):
|
|
||||||
"""Move a pending ha_websocket_dead container_restart to cancelled on recovery."""
|
|
||||||
action_id = f"container-restart-{node}-homeassistant"
|
|
||||||
pending_path = ACTIONS_DIR / "pending" / f"{action_id}.json"
|
|
||||||
if not pending_path.exists():
|
|
||||||
return
|
|
||||||
cancelled_dir = ACTIONS_DIR / "cancelled"
|
|
||||||
cancelled_dir.mkdir(parents=True, exist_ok=True)
|
|
||||||
dest = cancelled_dir / f"{action_id}.json"
|
|
||||||
try:
|
|
||||||
with open(pending_path) as f:
|
|
||||||
action = json.load(f)
|
|
||||||
action["status"] = "cancelled"
|
|
||||||
action["cancelled_reason"] = "ha_websocket_recovered"
|
|
||||||
action["cancelled_at"] = time.time()
|
|
||||||
_atomic_write_json(dest, action)
|
|
||||||
pending_path.unlink()
|
|
||||||
logger.info(f"Cancelled {action_id}: ha_websocket_recovered on {node}")
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Failed to cancel {action_id}: {e}")
|
|
||||||
|
|
||||||
def _write_pending_action(self, action: dict):
|
|
||||||
action_id = action["action_id"]
|
|
||||||
action_path = ACTIONS_DIR / "pending" / f"{action_id}.json"
|
|
||||||
try:
|
|
||||||
_atomic_write_json(action_path, action)
|
|
||||||
logger.info(
|
|
||||||
f"Generated HA action: {action_id} "
|
|
||||||
f"(type={action['type']}, risk={action['risk_level']})"
|
|
||||||
)
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Failed to save action {action_id}: {e}")
|
|
||||||
|
|
||||||
def loop(self, interval=30):
|
|
||||||
logger.info("Starting supervisor loop")
|
|
||||||
while True:
|
|
||||||
self.reconcile()
|
|
||||||
time.sleep(interval)
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
supervisor = Supervisor()
|
|
||||||
supervisor.loop()
|
|
||||||
|
|
@ -1,333 +0,0 @@
|
||||||
"""Tests for incident lifecycle: auto-resolve, orphan detection, timestamp parsing."""
|
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
import json
|
|
||||||
import sys
|
|
||||||
import time
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
import pytest
|
|
||||||
|
|
||||||
# Observer lives outside the control-plane package; add scripts/ to path.
|
|
||||||
sys.path.insert(0, str(Path(__file__).parent.parent.parent.parent / "scripts"))
|
|
||||||
from observer.observer import Observer, _parse_ts, _atomic_write_json
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Helpers
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
def _make_observer(tmp_path: Path) -> Observer:
|
|
||||||
"""Return an Observer with all runtime paths redirected to tmp_path."""
|
|
||||||
import observer.observer as obs_mod
|
|
||||||
|
|
||||||
world = tmp_path / "world"
|
|
||||||
state = tmp_path / "state"
|
|
||||||
events = tmp_path / "events"
|
|
||||||
logs = tmp_path / "logs"
|
|
||||||
repo = tmp_path / "repo"
|
|
||||||
|
|
||||||
for d in (world, state, events, logs, repo / "inventory", repo / "hosts"):
|
|
||||||
d.mkdir(parents=True, exist_ok=True)
|
|
||||||
|
|
||||||
# Minimal topology so inventory isn't empty (avoids prune-guard early-return)
|
|
||||||
(repo / "inventory" / "topology.yaml").write_text(
|
|
||||||
"nodes:\n vps:\n roles: [control-plane]\n connectivity: {}\n"
|
|
||||||
)
|
|
||||||
|
|
||||||
original_world = obs_mod.WORLD_DIR
|
|
||||||
original_state = obs_mod.STATE_DIR
|
|
||||||
original_events = obs_mod.EVENTS_DIR
|
|
||||||
original_logs = obs_mod.LOGS_DIR
|
|
||||||
original_inventory = obs_mod.INVENTORY_TOPOLOGY
|
|
||||||
original_repo = obs_mod.REPO_ROOT
|
|
||||||
|
|
||||||
obs_mod.WORLD_DIR = world
|
|
||||||
obs_mod.STATE_DIR = state
|
|
||||||
obs_mod.EVENTS_DIR = events
|
|
||||||
obs_mod.LOGS_DIR = logs
|
|
||||||
obs_mod.INVENTORY_TOPOLOGY = repo / "inventory" / "topology.yaml"
|
|
||||||
obs_mod.REPO_ROOT = repo
|
|
||||||
|
|
||||||
obs = Observer()
|
|
||||||
|
|
||||||
# Restore module-level constants (monkeypatching at module level is sufficient
|
|
||||||
# for the Observer instance which captures paths at construction time via globals)
|
|
||||||
obs_mod.WORLD_DIR = original_world
|
|
||||||
obs_mod.STATE_DIR = original_state
|
|
||||||
obs_mod.EVENTS_DIR = original_events
|
|
||||||
obs_mod.LOGS_DIR = original_logs
|
|
||||||
obs_mod.INVENTORY_TOPOLOGY = original_inventory
|
|
||||||
obs_mod.REPO_ROOT = original_repo
|
|
||||||
|
|
||||||
return obs
|
|
||||||
|
|
||||||
|
|
||||||
def _make_observer_simple(tmp_path: Path):
|
|
||||||
"""Return an Observer instance and patch its world_state in-place."""
|
|
||||||
import observer.observer as obs_mod
|
|
||||||
|
|
||||||
world = tmp_path / "world"
|
|
||||||
state = tmp_path / "state"
|
|
||||||
events = tmp_path / "events"
|
|
||||||
logs = tmp_path / "logs"
|
|
||||||
repo = tmp_path / "repo"
|
|
||||||
|
|
||||||
for d in (world, state, events, logs, repo / "inventory", repo / "hosts"):
|
|
||||||
d.mkdir(parents=True, exist_ok=True)
|
|
||||||
|
|
||||||
(repo / "inventory" / "topology.yaml").write_text(
|
|
||||||
"nodes:\n vps:\n roles: [control-plane]\n connectivity: {}\n"
|
|
||||||
)
|
|
||||||
|
|
||||||
# Patch before construction
|
|
||||||
obs_mod.WORLD_DIR = world
|
|
||||||
obs_mod.STATE_DIR = state
|
|
||||||
obs_mod.EVENTS_DIR = events
|
|
||||||
obs_mod.LOGS_DIR = logs
|
|
||||||
obs_mod.INVENTORY_TOPOLOGY = repo / "inventory" / "topology.yaml"
|
|
||||||
obs_mod.REPO_ROOT = repo
|
|
||||||
|
|
||||||
obs = Observer()
|
|
||||||
return obs
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# 1. _parse_ts — timestamp normalisation
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
def test_parse_ts_int():
|
|
||||||
ts = int(time.time()) - 3600
|
|
||||||
assert abs(_parse_ts(ts) - ts) < 1
|
|
||||||
|
|
||||||
|
|
||||||
def test_parse_ts_float():
|
|
||||||
ts = time.time() - 100.5
|
|
||||||
assert abs(_parse_ts(ts) - ts) < 0.01
|
|
||||||
|
|
||||||
|
|
||||||
def test_parse_ts_iso_string():
|
|
||||||
# ISO format as emitted by events.py / stability-agent
|
|
||||||
from datetime import datetime, timezone
|
|
||||||
iso = "2026-06-01T00:03:22Z"
|
|
||||||
expected = datetime(2026, 6, 1, 0, 3, 22, tzinfo=timezone.utc).timestamp()
|
|
||||||
result = _parse_ts(iso)
|
|
||||||
assert result > 0
|
|
||||||
assert isinstance(result, float)
|
|
||||||
assert abs(result - expected) < 1
|
|
||||||
|
|
||||||
|
|
||||||
def test_parse_ts_none_returns_zero():
|
|
||||||
assert _parse_ts(None) == 0.0
|
|
||||||
|
|
||||||
|
|
||||||
def test_parse_ts_garbage_returns_zero():
|
|
||||||
assert _parse_ts("not-a-date") == 0.0
|
|
||||||
|
|
||||||
|
|
||||||
def test_parse_ts_zero_int():
|
|
||||||
assert _parse_ts(0) == 0.0
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# 2. Lifecycle: service_healthy event resolves linked incident
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
def test_service_healthy_resolves_active_incident(tmp_path):
|
|
||||||
obs = _make_observer_simple(tmp_path)
|
|
||||||
inc_id = "inc-111-vps-outline"
|
|
||||||
obs.world_state["services"]["vps/outline"] = {
|
|
||||||
"node": "vps", "service": "outline",
|
|
||||||
"status": "unhealthy", "last_check": None,
|
|
||||||
"incident_id": inc_id,
|
|
||||||
}
|
|
||||||
obs.world_state["incidents"][inc_id] = {
|
|
||||||
"id": inc_id, "node": "vps", "service": "outline",
|
|
||||||
"status": "active", "trigger_type": "service_unhealthy",
|
|
||||||
"started_at": int(time.time()) - 600,
|
|
||||||
"last_occurrence": int(time.time()) - 600,
|
|
||||||
"occurrence_count": 1, "events": [],
|
|
||||||
}
|
|
||||||
|
|
||||||
obs.process_event({
|
|
||||||
"type": "service_healthy",
|
|
||||||
"node": "vps",
|
|
||||||
"service": "outline",
|
|
||||||
"severity": "info",
|
|
||||||
"timestamp": int(time.time()),
|
|
||||||
"payload": {},
|
|
||||||
})
|
|
||||||
|
|
||||||
assert obs.world_state["services"]["vps/outline"]["status"] == "healthy"
|
|
||||||
assert obs.world_state["services"]["vps/outline"]["incident_id"] is None
|
|
||||||
assert obs.world_state["incidents"][inc_id]["status"] == "resolved"
|
|
||||||
|
|
||||||
|
|
||||||
def test_service_healthy_does_not_resolve_other_incidents(tmp_path):
|
|
||||||
"""service_healthy for service A must not touch incident for service B."""
|
|
||||||
obs = _make_observer_simple(tmp_path)
|
|
||||||
inc_b = "inc-222-vps-supervisor"
|
|
||||||
obs.world_state["services"]["vps/supervisor"] = {
|
|
||||||
"node": "vps", "service": "supervisor",
|
|
||||||
"status": "unhealthy", "last_check": None,
|
|
||||||
"incident_id": inc_b,
|
|
||||||
}
|
|
||||||
obs.world_state["incidents"][inc_b] = {
|
|
||||||
"id": inc_b, "status": "active",
|
|
||||||
"last_occurrence": int(time.time()) - 300,
|
|
||||||
}
|
|
||||||
|
|
||||||
obs.process_event({
|
|
||||||
"type": "service_healthy",
|
|
||||||
"node": "vps",
|
|
||||||
"service": "outline", # different service
|
|
||||||
"severity": "info",
|
|
||||||
"timestamp": int(time.time()),
|
|
||||||
"payload": {},
|
|
||||||
})
|
|
||||||
|
|
||||||
assert obs.world_state["incidents"][inc_b]["status"] == "active"
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# 3. _prune_stale_world: healthy-service-linked incident → immediate resolve
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
def test_prune_resolves_healthy_linked_incident(tmp_path):
|
|
||||||
"""If a service is healthy but still points at an active incident, resolve it."""
|
|
||||||
obs = _make_observer_simple(tmp_path)
|
|
||||||
inc_id = "inc-333-vps-outline"
|
|
||||||
obs.world_state["services"]["vps/outline"] = {
|
|
||||||
"node": "vps", "service": "outline",
|
|
||||||
"status": "healthy", # <-- healthy but incident_id still set
|
|
||||||
"last_check": None,
|
|
||||||
"incident_id": inc_id,
|
|
||||||
}
|
|
||||||
obs.world_state["incidents"][inc_id] = {
|
|
||||||
"id": inc_id, "status": "active",
|
|
||||||
"started_at": int(time.time()) - 7200,
|
|
||||||
"last_occurrence": int(time.time()) - 7200,
|
|
||||||
}
|
|
||||||
|
|
||||||
obs._prune_stale_world()
|
|
||||||
|
|
||||||
assert obs.world_state["services"]["vps/outline"]["incident_id"] is None
|
|
||||||
assert obs.world_state["incidents"][inc_id]["status"] == "resolved"
|
|
||||||
|
|
||||||
|
|
||||||
def test_prune_resolves_healthy_linked_incident_iso_timestamp(tmp_path):
|
|
||||||
"""Healthy-linked incident with ISO-string last_occurrence must still resolve."""
|
|
||||||
obs = _make_observer_simple(tmp_path)
|
|
||||||
inc_id = "inc-444-vps-outline"
|
|
||||||
obs.world_state["services"]["vps/outline"] = {
|
|
||||||
"node": "vps", "service": "outline",
|
|
||||||
"status": "healthy", "last_check": None, "incident_id": inc_id,
|
|
||||||
}
|
|
||||||
obs.world_state["incidents"][inc_id] = {
|
|
||||||
"id": inc_id, "status": "active",
|
|
||||||
"last_occurrence": "2026-06-01T00:03:22Z", # ISO string from events.py
|
|
||||||
}
|
|
||||||
|
|
||||||
obs._prune_stale_world() # must not raise TypeError
|
|
||||||
|
|
||||||
assert obs.world_state["incidents"][inc_id]["status"] == "resolved"
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# 4. _prune_stale_world: orphaned incident (no service link) → resolve after 5 min
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
def test_prune_resolves_orphaned_incident_old_enough(tmp_path):
|
|
||||||
"""Orphaned active incident older than 5 min must be auto-resolved."""
|
|
||||||
obs = _make_observer_simple(tmp_path)
|
|
||||||
inc_id = "inc-555-vps-supervisor"
|
|
||||||
# No service entry links to this incident
|
|
||||||
obs.world_state["incidents"][inc_id] = {
|
|
||||||
"id": inc_id, "status": "active", "node": "vps", "service": "supervisor",
|
|
||||||
"last_occurrence": int(time.time()) - 400, # 6.7 min ago
|
|
||||||
}
|
|
||||||
|
|
||||||
obs._prune_stale_world()
|
|
||||||
|
|
||||||
assert obs.world_state["incidents"][inc_id]["status"] == "resolved"
|
|
||||||
|
|
||||||
|
|
||||||
def test_prune_does_not_resolve_orphaned_incident_too_recent(tmp_path):
|
|
||||||
"""Orphaned incident younger than 5 min must stay active (guard against race)."""
|
|
||||||
obs = _make_observer_simple(tmp_path)
|
|
||||||
inc_id = "inc-666-vps-supervisor"
|
|
||||||
obs.world_state["incidents"][inc_id] = {
|
|
||||||
"id": inc_id, "status": "active",
|
|
||||||
"last_occurrence": int(time.time()) - 60, # 1 min ago — within guard
|
|
||||||
}
|
|
||||||
|
|
||||||
obs._prune_stale_world()
|
|
||||||
|
|
||||||
assert obs.world_state["incidents"][inc_id]["status"] == "active"
|
|
||||||
|
|
||||||
|
|
||||||
def test_prune_resolves_orphaned_incident_iso_timestamp(tmp_path):
|
|
||||||
"""Orphaned incident with ISO-string last_occurrence must resolve correctly."""
|
|
||||||
obs = _make_observer_simple(tmp_path)
|
|
||||||
inc_id = "inc-777-vps-outline"
|
|
||||||
# ISO timestamp well in the past (2026-06-01)
|
|
||||||
obs.world_state["incidents"][inc_id] = {
|
|
||||||
"id": inc_id, "status": "active",
|
|
||||||
"last_occurrence": "2026-06-01T00:03:22Z",
|
|
||||||
}
|
|
||||||
|
|
||||||
obs._prune_stale_world() # must not raise TypeError
|
|
||||||
|
|
||||||
assert obs.world_state["incidents"][inc_id]["status"] == "resolved"
|
|
||||||
|
|
||||||
|
|
||||||
def test_prune_does_not_touch_linked_incident(tmp_path):
|
|
||||||
"""An active incident still linked from a non-healthy service must stay active."""
|
|
||||||
obs = _make_observer_simple(tmp_path)
|
|
||||||
inc_id = "inc-888-vps-outline"
|
|
||||||
obs.world_state["services"]["vps/outline"] = {
|
|
||||||
"node": "vps", "service": "outline",
|
|
||||||
"status": "unhealthy", # <-- still unhealthy
|
|
||||||
"last_check": None,
|
|
||||||
"incident_id": inc_id,
|
|
||||||
}
|
|
||||||
obs.world_state["incidents"][inc_id] = {
|
|
||||||
"id": inc_id, "status": "active",
|
|
||||||
"last_occurrence": int(time.time()) - 3600,
|
|
||||||
}
|
|
||||||
|
|
||||||
obs._prune_stale_world()
|
|
||||||
|
|
||||||
assert obs.world_state["incidents"][inc_id]["status"] == "active"
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# 5. 7-day stale incident prune with ISO resolved_at
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
def test_prune_removes_old_resolved_incident_iso_resolved_at(tmp_path):
|
|
||||||
"""Resolved incidents with ISO-string resolved_at older than 7 days must be pruned."""
|
|
||||||
obs = _make_observer_simple(tmp_path)
|
|
||||||
inc_id = "inc-old-resolved"
|
|
||||||
obs.world_state["incidents"][inc_id] = {
|
|
||||||
"id": inc_id, "status": "resolved",
|
|
||||||
"resolved_at": "2026-05-01T00:00:00Z", # >7 days before 2026-06-03
|
|
||||||
}
|
|
||||||
|
|
||||||
obs._prune_stale_world()
|
|
||||||
|
|
||||||
assert inc_id not in obs.world_state["incidents"]
|
|
||||||
|
|
||||||
|
|
||||||
def test_prune_keeps_recently_resolved_incident(tmp_path):
|
|
||||||
"""Resolved incidents within 7 days must be kept."""
|
|
||||||
obs = _make_observer_simple(tmp_path)
|
|
||||||
inc_id = "inc-recent-resolved"
|
|
||||||
obs.world_state["incidents"][inc_id] = {
|
|
||||||
"id": inc_id, "status": "resolved",
|
|
||||||
"resolved_at": time.time() - 86400, # 1 day ago
|
|
||||||
}
|
|
||||||
|
|
||||||
obs._prune_stale_world()
|
|
||||||
|
|
||||||
assert inc_id in obs.world_state["incidents"]
|
|
||||||
|
|
@ -1,199 +0,0 @@
|
||||||
"""Tests for atomic writes and resilient world-state loading in the supervisor."""
|
|
||||||
from __future__ import annotations
|
|
||||||
|
|
||||||
import json
|
|
||||||
import sys
|
|
||||||
import time
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
import pytest
|
|
||||||
|
|
||||||
sys.path.insert(0, str(Path(__file__).parent.parent / "src"))
|
|
||||||
import supervisor as supervisor_module
|
|
||||||
from supervisor import Supervisor, _atomic_write_json
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# Helpers (reused from test_supervisor_ha)
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
def _setup_supervisor(tmp_path: Path, monkeypatch) -> Supervisor:
|
|
||||||
actions = tmp_path / "actions"
|
|
||||||
events = tmp_path / "events"
|
|
||||||
world = tmp_path / "world"
|
|
||||||
repo = tmp_path / "repo"
|
|
||||||
|
|
||||||
for d in (actions, events, world, repo / "hosts"):
|
|
||||||
d.mkdir(parents=True, exist_ok=True)
|
|
||||||
|
|
||||||
monkeypatch.setattr(supervisor_module, "ACTIONS_DIR", actions)
|
|
||||||
monkeypatch.setattr(supervisor_module, "EVENTS_DIR", events)
|
|
||||||
monkeypatch.setattr(supervisor_module, "WORLD_DIR", world)
|
|
||||||
monkeypatch.setattr(supervisor_module, "REPO_ROOT", repo)
|
|
||||||
|
|
||||||
sup = Supervisor()
|
|
||||||
sup.desired_state = {"services": {}}
|
|
||||||
sup.actual_state = {"services": {}, "nodes": {}, "incidents": {}}
|
|
||||||
return sup
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# 1. atomic_write_json correctness
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
def test_atomic_write_json_produces_valid_json(tmp_path):
|
|
||||||
path = tmp_path / "out.json"
|
|
||||||
data = {"services": {"vps/outline": {"status": "healthy"}}, "count": 42}
|
|
||||||
_atomic_write_json(path, data)
|
|
||||||
|
|
||||||
assert path.exists(), "output file must exist after atomic write"
|
|
||||||
loaded = json.loads(path.read_text())
|
|
||||||
assert loaded == data
|
|
||||||
|
|
||||||
|
|
||||||
def test_atomic_write_json_no_tmp_left_behind(tmp_path):
|
|
||||||
path = tmp_path / "world.json"
|
|
||||||
_atomic_write_json(path, {"ok": True})
|
|
||||||
|
|
||||||
tmp = path.with_suffix(".tmp")
|
|
||||||
assert not tmp.exists(), ".tmp must be cleaned up by os.replace"
|
|
||||||
|
|
||||||
|
|
||||||
def test_atomic_write_json_overwrites_existing(tmp_path):
|
|
||||||
path = tmp_path / "state.json"
|
|
||||||
path.write_text('{"old": true}')
|
|
||||||
_atomic_write_json(path, {"new": True})
|
|
||||||
assert json.loads(path.read_text()) == {"new": True}
|
|
||||||
|
|
||||||
|
|
||||||
def test_atomic_write_json_nested_structure(tmp_path):
|
|
||||||
path = tmp_path / "complex.json"
|
|
||||||
data = {
|
|
||||||
"nodes": {"vps": {"status": "online", "disk_usage_pct": 42}},
|
|
||||||
"incidents": {},
|
|
||||||
"list": [1, 2, 3],
|
|
||||||
}
|
|
||||||
_atomic_write_json(path, data)
|
|
||||||
assert json.loads(path.read_text()) == data
|
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
# 2. Resilient loader: empty / truncated file → skip cycle, no drift
|
|
||||||
# ---------------------------------------------------------------------------
|
|
||||||
|
|
||||||
def _populate_desired(sup: Supervisor, svc_key: str = "vps/outline"):
|
|
||||||
node, service = svc_key.split("/", 1)
|
|
||||||
sup.desired_state["services"][svc_key] = {
|
|
||||||
"node": node,
|
|
||||||
"service": service,
|
|
||||||
"desired": "running",
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
def test_empty_services_json_skips_reconcile(tmp_path, monkeypatch):
|
|
||||||
"""Empty services.json (truncated write) must not generate any redeploy action."""
|
|
||||||
sup = _setup_supervisor(tmp_path, monkeypatch)
|
|
||||||
_populate_desired(sup)
|
|
||||||
|
|
||||||
# Write empty services.json — simulates a mid-write truncation
|
|
||||||
(tmp_path / "world" / "services.json").write_text("")
|
|
||||||
(tmp_path / "world" / "nodes.json").write_text("{}")
|
|
||||||
(tmp_path / "world" / "incidents.json").write_text("{}")
|
|
||||||
|
|
||||||
sup.reconcile()
|
|
||||||
|
|
||||||
pending = list((tmp_path / "actions" / "pending").glob("*.json"))
|
|
||||||
assert pending == [], f"No actions should be generated on empty state file, got: {[p.name for p in pending]}"
|
|
||||||
|
|
||||||
|
|
||||||
def test_truncated_services_json_skips_reconcile(tmp_path, monkeypatch):
|
|
||||||
"""Partially-written (truncated mid-write) JSON must not generate any action."""
|
|
||||||
sup = _setup_supervisor(tmp_path, monkeypatch)
|
|
||||||
_populate_desired(sup)
|
|
||||||
|
|
||||||
(tmp_path / "world" / "services.json").write_text('{"vps/outline": {"status": "hea')
|
|
||||||
(tmp_path / "world" / "nodes.json").write_text("{}")
|
|
||||||
(tmp_path / "world" / "incidents.json").write_text("{}")
|
|
||||||
|
|
||||||
sup.reconcile()
|
|
||||||
|
|
||||||
pending = list((tmp_path / "actions" / "pending").glob("*.json"))
|
|
||||||
assert pending == [], f"No actions expected on truncated state, got: {[p.name for p in pending]}"
|
|
||||||
|
|
||||||
|
|
||||||
def test_empty_incidents_json_skips_reconcile(tmp_path, monkeypatch):
|
|
||||||
"""Empty incidents.json (any world-state file failing) skips full cycle."""
|
|
||||||
sup = _setup_supervisor(tmp_path, monkeypatch)
|
|
||||||
_populate_desired(sup)
|
|
||||||
|
|
||||||
(tmp_path / "world" / "services.json").write_text("{}")
|
|
||||||
(tmp_path / "world" / "nodes.json").write_text("{}")
|
|
||||||
(tmp_path / "world" / "incidents.json").write_text("")
|
|
||||||
|
|
||||||
sup.reconcile()
|
|
||||||
|
|
||||||
pending = list((tmp_path / "actions" / "pending").glob("*.json"))
|
|
||||||
assert pending == [], f"No actions expected when any state file is unreadable, got: {[p.name for p in pending]}"
|
|
||||||
|
|
||||||
|
|
||||||
def test_load_actual_state_returns_false_on_empty_file(tmp_path, monkeypatch):
|
|
||||||
"""_load_actual_state must return False (not raise) when a file is empty."""
|
|
||||||
sup = _setup_supervisor(tmp_path, monkeypatch)
|
|
||||||
|
|
||||||
(tmp_path / "world" / "services.json").write_text("")
|
|
||||||
(tmp_path / "world" / "nodes.json").write_text("{}")
|
|
||||||
(tmp_path / "world" / "incidents.json").write_text("{}")
|
|
||||||
|
|
||||||
result = sup._load_actual_state()
|
|
||||||
assert result is False
|
|
||||||
|
|
||||||
|
|
||||||
def test_load_actual_state_returns_true_on_valid_files(tmp_path, monkeypatch):
|
|
||||||
"""_load_actual_state returns True and populates actual_state on valid files."""
|
|
||||||
sup = _setup_supervisor(tmp_path, monkeypatch)
|
|
||||||
|
|
||||||
services = {"vps/outline": {"node": "vps", "service": "outline", "status": "healthy"}}
|
|
||||||
(tmp_path / "world" / "services.json").write_text(json.dumps(services))
|
|
||||||
(tmp_path / "world" / "nodes.json").write_text('{"vps": {"status": "online"}}')
|
|
||||||
(tmp_path / "world" / "incidents.json").write_text("{}")
|
|
||||||
|
|
||||||
result = sup._load_actual_state()
|
|
||||||
assert result is True
|
|
||||||
assert "vps/outline" in sup.actual_state["services"]
|
|
||||||
|
|
||||||
|
|
||||||
def test_parse_failure_preserves_last_known_good_state(tmp_path, monkeypatch):
|
|
||||||
"""When a file becomes unreadable, actual_state retains the previous good values."""
|
|
||||||
sup = _setup_supervisor(tmp_path, monkeypatch)
|
|
||||||
|
|
||||||
# First successful load
|
|
||||||
services = {"vps/outline": {"node": "vps", "service": "outline", "status": "healthy"}}
|
|
||||||
(tmp_path / "world" / "services.json").write_text(json.dumps(services))
|
|
||||||
(tmp_path / "world" / "nodes.json").write_text("{}")
|
|
||||||
(tmp_path / "world" / "incidents.json").write_text("{}")
|
|
||||||
assert sup._load_actual_state() is True
|
|
||||||
assert "vps/outline" in sup.actual_state["services"]
|
|
||||||
|
|
||||||
# File becomes empty (race condition)
|
|
||||||
(tmp_path / "world" / "services.json").write_text("")
|
|
||||||
assert sup._load_actual_state() is False
|
|
||||||
|
|
||||||
# State must be unchanged from the previous good load
|
|
||||||
assert "vps/outline" in sup.actual_state["services"], \
|
|
||||||
"Last-known-good state must be preserved on parse failure"
|
|
||||||
|
|
||||||
|
|
||||||
def test_healthy_service_does_not_generate_action(tmp_path, monkeypatch):
|
|
||||||
"""A desired service that appears healthy in world state generates no action."""
|
|
||||||
sup = _setup_supervisor(tmp_path, monkeypatch)
|
|
||||||
_populate_desired(sup)
|
|
||||||
|
|
||||||
services = {"vps/outline": {"node": "vps", "service": "outline", "status": "healthy"}}
|
|
||||||
(tmp_path / "world" / "services.json").write_text(json.dumps(services))
|
|
||||||
(tmp_path / "world" / "nodes.json").write_text("{}")
|
|
||||||
(tmp_path / "world" / "incidents.json").write_text("{}")
|
|
||||||
|
|
||||||
sup.reconcile()
|
|
||||||
|
|
||||||
pending = list((tmp_path / "actions" / "pending").glob("*.json"))
|
|
||||||
assert pending == [], "Healthy service must not generate any action"
|
|
||||||
Some files were not shown because too many files have changed in this diff Show more
Loading…
Reference in a new issue