Compare commits
105 commits
v0.3-chels
...
master
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
58ac6edd7d | ||
|
|
19fd8799d9 | ||
|
|
7f17b65278 | ||
|
|
e6a2443412 | ||
|
|
f9b145585f | ||
|
|
3b620ef7e3 | ||
|
|
745e52723c | ||
|
|
1abe925f65 | ||
|
|
1c69a5bc29 | ||
|
|
02e7c28823 | ||
|
|
db592fbc28 | ||
|
|
00fc36df3a | ||
|
|
f5dcefc752 | ||
|
|
98437d46b2 | ||
|
|
5e97b4e448 | ||
|
|
ffb0608b9a | ||
|
|
f381023206 | ||
|
|
cb4ae756ab | ||
|
|
cfe5e02372 | ||
|
|
039f9f7247 | ||
|
|
495741e7ac | ||
|
|
43c5d45353 | ||
|
|
f64cec645e | ||
|
|
1db9db7d03 | ||
|
|
52607a7cdd | ||
|
|
b9ed118b8c | ||
|
|
bf1415e4c1 | ||
|
|
31b48d162a | ||
|
|
3499b2f280 | ||
|
|
f41ec5d0c5 | ||
|
|
20f6761a67 | ||
|
|
07bd498fd6 | ||
|
|
90c8e77bf7 | ||
|
|
ab8895d28b | ||
|
|
bd7f955e4e | ||
|
|
99200e6690 | ||
|
|
dcacac6965 | ||
|
|
e52b2e2259 | ||
|
|
5ccdfa0ca6 | ||
|
|
ff6fda1f04 | ||
|
|
ca37fca5ce | ||
|
|
1bbc511bb7 | ||
|
|
603e10a364 | ||
|
|
7277bdc27f | ||
|
|
b40b832159 | ||
|
|
28e9534765 | ||
|
|
46ae92b5c1 | ||
|
|
410bfe7065 | ||
|
|
b3912fe0ce | ||
|
|
61e07f4318 | ||
|
|
51002d4502 | ||
|
|
fb7828b52b | ||
|
|
2f1965733f | ||
|
|
267742c7d7 | ||
|
|
4e8968f9c7 | ||
|
|
f4a8db93e4 | ||
|
|
a5a3e223dc | ||
|
|
2349de518b | ||
|
|
65bac4ebfe | ||
|
|
96bf32614f | ||
|
|
ae33cce889 | ||
|
|
c5c080b3e3 | ||
|
|
01b7758fe6 | ||
|
|
7742bda245 | ||
|
|
98fe1f1846 | ||
|
|
beb8b5cbaa | ||
|
|
898deda05f | ||
|
|
f34399a30d | ||
|
|
9b39581b53 | ||
|
|
ae7446a04b | ||
|
|
f21be4f4d4 | ||
|
|
8fb4d3d634 | ||
|
|
35e57cc789 | ||
|
|
b02c8bb50e | ||
|
|
dc483ae31a | ||
|
|
9d2f748557 | ||
|
|
8a12b7ff17 | ||
|
|
f65698925e | ||
|
|
9f20dcae05 | ||
|
|
b7251ac416 | ||
|
|
807b097eb4 | ||
|
|
5754994f8e | ||
|
|
c299a2cb85 | ||
|
|
b129f03837 | ||
|
|
b7faac00c5 | ||
|
|
8f305ba3df | ||
|
|
c9ddfa9ac1 | ||
|
|
3233cf07cd | ||
|
|
ac90acfac8 | ||
|
|
12a775c834 | ||
|
|
41c05f42b5 | ||
|
|
e8d6d6d473 | ||
|
|
8d0f2379ba | ||
|
|
90b2a5d0e9 | ||
|
|
b726048d41 | ||
|
|
533b8e846d | ||
|
|
f4e6871d76 | ||
|
|
793559a4b5 | ||
|
|
0cf1106b34 | ||
|
|
2029457f57 | ||
|
|
8f5b905015 | ||
|
|
72c5a53610 | ||
|
|
431d777989 | ||
|
|
95a976e930 | ||
|
|
0eeb0ac600 |
43
.claude/skills/deploy/SKILL.md
Normal file
43
.claude/skills/deploy/SKILL.md
Normal file
|
|
@ -0,0 +1,43 @@
|
||||||
|
---
|
||||||
|
name: deploy
|
||||||
|
description: Deploy, redeploy, or ship homelab services to a target node. Trigger on any request containing deploy / redeploy / wdróż / zredeployuj / ship for targets control-plane, vps, piha, solaria, or chelsty-infra.
|
||||||
|
---
|
||||||
|
|
||||||
|
Always invoke `scripts/deploy/deploy.sh <target> [--dry-run] [--no-gate]` as the **sole entry point**.
|
||||||
|
Never call `deploy-control-plane.sh`, `deploy-node.sh`, or `deploy-local.sh` directly.
|
||||||
|
|
||||||
|
## Targets
|
||||||
|
|
||||||
|
| Target | What it deploys |
|
||||||
|
|---|---|
|
||||||
|
| `control-plane` | observer, supervisor, executor, operator-ui on VPS |
|
||||||
|
| `vps` | all VPS GitOps services (node-agent, npm, outline, joplin, ai-cluster, …) |
|
||||||
|
| `piha` | PIHA services (ha-diag-agent, node-agent, redis, …) |
|
||||||
|
| `solaria` | SOLARIA compute services |
|
||||||
|
| `chelsty-infra` | CHELSTY LTE edge node (30 s SSH timeout) |
|
||||||
|
|
||||||
|
## Invocation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
scripts/deploy/deploy.sh <target> # full pipeline
|
||||||
|
scripts/deploy/deploy.sh <target> --dry-run # preflight + gate only
|
||||||
|
scripts/deploy/deploy.sh <target> --no-gate # emergency: bypass tests
|
||||||
|
```
|
||||||
|
|
||||||
|
## Exit Code Handling
|
||||||
|
|
||||||
|
| Code | Meaning | Required action |
|
||||||
|
|---|---|---|
|
||||||
|
| 0 | Success | Report: target, commit hash, gate status, verify status, elapsed time |
|
||||||
|
| 1 | Preflight failed | Fix the upstream issue (push commits, wake node, switch to master). Never bypass. |
|
||||||
|
| 2 | Gate failed | Show exactly which test/build failed. Do **not** deploy. Fix the failure first. |
|
||||||
|
| 3 | Execute failed | Show full deploy output. Ask user whether to investigate or rollback. |
|
||||||
|
| 4 | Verify failed | Show docker ps output. Discuss rollback with the user. |
|
||||||
|
| 5 | Sudo handoff | Print the exact manual command from stderr **verbatim** and stop. User must run it. |
|
||||||
|
|
||||||
|
## Rules
|
||||||
|
|
||||||
|
- Never pass `--no-gate` unless the user explicitly requests emergency/bypass mode.
|
||||||
|
- Never deploy uncommitted or unpushed code — preflight enforces this; do not help circumvent it.
|
||||||
|
- Canonical branch is `master` — preflight enforces this.
|
||||||
|
- For exit 5: reproduce the handoff command exactly as printed to stderr, then stop.
|
||||||
65
.claude/skills/save-session/SKILL.md
Normal file
65
.claude/skills/save-session/SKILL.md
Normal file
|
|
@ -0,0 +1,65 @@
|
||||||
|
---
|
||||||
|
name: save-session
|
||||||
|
description: Save and record the current work session to docs/sessions/. Trigger ONLY on explicit "save session", "zapisz sesję", or "wrap up" — never invoke proactively between tasks.
|
||||||
|
---
|
||||||
|
|
||||||
|
**Trigger condition**: user explicitly says "save session", "zapisz sesję", "wrap up", or equivalent.
|
||||||
|
Never invoke proactively. Never invoke mid-task.
|
||||||
|
|
||||||
|
## 1. Determine Session Boundary
|
||||||
|
|
||||||
|
1. Read the latest entry file in `docs/sessions/` — use its last `## Session HH:MM` heading timestamp as the start boundary.
|
||||||
|
2. Fallback if no previous entry exists: 24 hours ago.
|
||||||
|
|
||||||
|
## 2. Collect Facts (deterministic only — no invention)
|
||||||
|
|
||||||
|
Run exactly:
|
||||||
|
```bash
|
||||||
|
# All commits since boundary
|
||||||
|
git --no-pager log --oneline <boundary>..HEAD
|
||||||
|
|
||||||
|
# Changed file summary
|
||||||
|
git --no-pager diff --stat <boundary>..HEAD
|
||||||
|
```
|
||||||
|
|
||||||
|
From the visible conversation transcript: deploys run and their outcomes, test results seen.
|
||||||
|
|
||||||
|
## 3. Write the Session Entry
|
||||||
|
|
||||||
|
**APPEND** to `docs/sessions/YYYY-MM-DD.md` (create the file if it doesn't exist for today).
|
||||||
|
Never overwrite existing content.
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
## Session HH:MM
|
||||||
|
|
||||||
|
### Commits
|
||||||
|
<output of git log --oneline>
|
||||||
|
|
||||||
|
### Files changed
|
||||||
|
<output of git diff --stat>
|
||||||
|
|
||||||
|
### Deploys
|
||||||
|
<list from transcript, or "None recorded">
|
||||||
|
|
||||||
|
### Narrative
|
||||||
|
> _user-provided summary_
|
||||||
|
```
|
||||||
|
|
||||||
|
The `> _user-provided summary_` placeholder is **mandatory**. Never fill it in. The user supplies the narrative separately if desired.
|
||||||
|
|
||||||
|
## 4. What NOT to Touch
|
||||||
|
|
||||||
|
- `backlog.md` — only on explicit "update backlog" instruction
|
||||||
|
- `CLAUDE.md` — only on explicit "update CLAUDE.md" instruction
|
||||||
|
- Any other file not listed above
|
||||||
|
|
||||||
|
## 5. Commit
|
||||||
|
|
||||||
|
Stage and commit **only** the session file:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git add docs/sessions/YYYY-MM-DD.md
|
||||||
|
git commit -m "docs: session YYYY-MM-DD HH:MM"
|
||||||
|
```
|
||||||
|
|
||||||
|
No other files. No `git add -A`.
|
||||||
81
.claude/skills/worktree-aware/SKILL.md
Normal file
81
.claude/skills/worktree-aware/SKILL.md
Normal file
|
|
@ -0,0 +1,81 @@
|
||||||
|
---
|
||||||
|
name: worktree-aware
|
||||||
|
description: >
|
||||||
|
Use when working in a git worktree checkout for a parallel agent task.
|
||||||
|
The presence of an .agent-task file in the current working directory indicates
|
||||||
|
a task worktree (NOT the main checkout). Encodes branch hygiene: commit only
|
||||||
|
to the assigned task branch, NEVER push origin master, NEVER touch the main
|
||||||
|
checkout at ~/homelab-codex-ws, NEVER manage worktrees yourself. On task
|
||||||
|
completion, report the branch name verbatim and stop — the human merges via
|
||||||
|
scripts/dev/agent.sh.
|
||||||
|
---
|
||||||
|
|
||||||
|
## When this applies
|
||||||
|
|
||||||
|
- `.agent-task` present in your `cwd` → you are in a task worktree. Apply all rules below.
|
||||||
|
- `.agent-task` absent → you are in the main checkout. Do NOT treat yourself as a task agent.
|
||||||
|
In the main checkout these rules do not apply.
|
||||||
|
|
||||||
|
## Reading the marker
|
||||||
|
|
||||||
|
`.agent-task` is a YAML file. Your assigned branch is the value of the `branch:` key, e.g.:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
task: my-feature
|
||||||
|
branch: task/my-feature
|
||||||
|
parent_commit: abc1234
|
||||||
|
created_utc: 2026-06-03T10:00:00Z
|
||||||
|
worktree_path: /home/oskar/homelab-codex-ws-my-feature
|
||||||
|
```
|
||||||
|
|
||||||
|
Always read this file first before taking any action.
|
||||||
|
|
||||||
|
## Rules
|
||||||
|
|
||||||
|
1. **Commit only to your branch.**
|
||||||
|
Before any `git commit`, run `git status` and confirm it says `On branch task/<name>`.
|
||||||
|
If it does not, stop immediately and report the discrepancy.
|
||||||
|
|
||||||
|
2. **Push only to your branch.**
|
||||||
|
The only permitted push is `git push origin task/<name>`.
|
||||||
|
NEVER `git push origin master` or any other branch.
|
||||||
|
|
||||||
|
3. **Do not touch the main checkout.**
|
||||||
|
`~/homelab-codex-ws/` is the main checkout — deploy-only, owned by the human.
|
||||||
|
Do not read from, write to, or execute commands inside it.
|
||||||
|
|
||||||
|
4. **Stay scoped.**
|
||||||
|
Only change files directly related to your assigned task.
|
||||||
|
If you notice other problems, report them in your final summary as separate follow-up proposals.
|
||||||
|
Do not fix them in this worktree.
|
||||||
|
|
||||||
|
5. **Never `git add -A`.**
|
||||||
|
Always stage specific files by name: `git add path/to/file`.
|
||||||
|
|
||||||
|
6. **Do not manage worktrees.**
|
||||||
|
Never run `git worktree add/remove` or invoke `scripts/dev/agent.sh`.
|
||||||
|
Worktree lifecycle is the human's responsibility.
|
||||||
|
|
||||||
|
7. **Final report before stopping.**
|
||||||
|
When the task is done, provide a structured report containing:
|
||||||
|
- Files changed (path and one-line summary of change)
|
||||||
|
- Tests run and results
|
||||||
|
- All commit hashes on the task branch
|
||||||
|
- **Branch name verbatim** (copy-paste ready)
|
||||||
|
- Follow-up items as bulleted proposals for separate tasks
|
||||||
|
|
||||||
|
## Definition of Done
|
||||||
|
|
||||||
|
- All commits are on `task/<name>` (verify with `git log --oneline master..task/<name>`)
|
||||||
|
- Test suite passes
|
||||||
|
- Branch pushed: `git push origin task/<name>`
|
||||||
|
- Full report delivered in conversation
|
||||||
|
|
||||||
|
## What you do NOT do
|
||||||
|
|
||||||
|
- Merge branches
|
||||||
|
- Create or push tags
|
||||||
|
- Run deploys or healthchecks against production nodes
|
||||||
|
- Delete branches or worktrees
|
||||||
|
- Modify files in other worktrees
|
||||||
|
- Push to `origin master` under any circumstances
|
||||||
1
.gitignore
vendored
1
.gitignore
vendored
|
|
@ -15,6 +15,7 @@ __pycache__/
|
||||||
*$py.class
|
*$py.class
|
||||||
venv/
|
venv/
|
||||||
.venv/
|
.venv/
|
||||||
|
*.egg-info/
|
||||||
|
|
||||||
# Tools
|
# Tools
|
||||||
.aider*
|
.aider*
|
||||||
|
|
|
||||||
194
CLAUDE.md
Normal file
194
CLAUDE.md
Normal file
|
|
@ -0,0 +1,194 @@
|
||||||
|
# CLAUDE.md
|
||||||
|
|
||||||
|
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
||||||
|
|
||||||
|
## What This Repo Is
|
||||||
|
|
||||||
|
GitOps-lite orchestration for a distributed homelab. The repo is the source of truth for infrastructure definitions; runtime state lives at `/opt/homelab/` on each execution node and is never committed.
|
||||||
|
|
||||||
|
## Node Roles
|
||||||
|
|
||||||
|
| Host | Role |
|
||||||
|
|------|------|
|
||||||
|
| **SATURN** | Primary control node — only node where commits are made |
|
||||||
|
| **SOLARIA** | GPU/compute/AI workloads |
|
||||||
|
| **PIHA** | Infra, monitoring |
|
||||||
|
| **VPS** | Public ingress, reverse proxy, control plane host |
|
||||||
|
| **CHELSTY-INFRA** | LTE edge hypervisor (site: chelsty); Zigbee2MQTT, Mosquitto, stability-agent — offline-first |
|
||||||
|
| **CHELSTY-HA** | LTE Home Assistant VM (site: chelsty); connects to CHELSTY-INFRA MQTT broker — offline-first |
|
||||||
|
|
||||||
|
All nodes communicate over Tailscale. CHELSTY-INFRA and CHELSTY-HA have an intermittent LTE uplink; their services must never depend on SATURN, VPS, or Forgejo at runtime. Full node capabilities: `hosts/<node>/capabilities.yaml`.
|
||||||
|
|
||||||
|
## Deployment
|
||||||
|
|
||||||
|
```bash
|
||||||
|
scripts/deploy/deploy.sh # fresh deploy on current node
|
||||||
|
scripts/deploy/deploy.sh --resume # resume after interruption
|
||||||
|
scripts/deploy/deploy.sh --stage verify # specific stage only
|
||||||
|
scripts/deploy/deploy.sh --service mosquitto # specific service only
|
||||||
|
./scripts/deploy/deploy-control-plane.sh --ssh # SATURN/SOLARIA → VPS
|
||||||
|
./scripts/deploy/deploy-node.sh chelsty-infra # CHELSTY nodes (individually)
|
||||||
|
./scripts/bootstrap/prepare-node.sh # general node bootstrap
|
||||||
|
./scripts/bootstrap/chelsty-runtime.sh # CHELSTY-specific bootstrap
|
||||||
|
```
|
||||||
|
|
||||||
|
Pipeline stages: **prepare → validate → deploy → verify → diagnose (on failure) → complete**. Stage state persisted in `/opt/homelab/state/deploy/`.
|
||||||
|
|
||||||
|
## Service Structure
|
||||||
|
|
||||||
|
Every service must follow this layout:
|
||||||
|
|
||||||
|
```
|
||||||
|
services/<service>/
|
||||||
|
├── docker-compose.yml
|
||||||
|
├── service.yaml # Machine-readable contract (primary source of truth for agents)
|
||||||
|
├── README.md
|
||||||
|
├── env.example # Template — never commit actual secrets
|
||||||
|
└── healthcheck.sh # Returns 0 (healthy) or 1 (unhealthy)
|
||||||
|
```
|
||||||
|
|
||||||
|
`service.yaml` defines `owner_node`, `exposure`, `dependencies`, `healthcheck`, `restart_policy`, `persistence.paths`, and `runtime.env_vars`. This is what AI agents read to understand how to manage a service.
|
||||||
|
|
||||||
|
Host-specific runtime config and secrets live at `/opt/homelab/config/<service>/` on the target node (not in Git). Docker Compose overrides are version-controlled at `hosts/<node>/runtime/<service>/docker-compose.override.yml` in this repo and applied during deployment.
|
||||||
|
|
||||||
|
## Agent System Architecture
|
||||||
|
|
||||||
|
The platform uses a multi-agent model with **human-in-the-loop** for destructive actions:
|
||||||
|
|
||||||
|
1. **Stability Agent** (`services/stability-agent/`) — Per-node watchdog. Monitors Docker containers, disk, Tailscale, MQTT. Emits filesystem events. Does NOT restart services autonomously.
|
||||||
|
2. **Observer** (`services/control-plane/src/`) — Synthesizes world state from events into `/opt/homelab/world/{nodes,services,deployments,incidents}.json`.
|
||||||
|
3. **Supervisor** — Detects drift between desired state (from `hosts/*/services.yaml`) and actual state (from Observer output). Writes `pending` action JSON files.
|
||||||
|
4. **Executor** — Executes actions only after they transition to `approved`.
|
||||||
|
5. **Operator UI** + **Telegram Bot** — Operators review and approve/reject pending actions.
|
||||||
|
|
||||||
|
### Action approval flow
|
||||||
|
```
|
||||||
|
Agent → /opt/homelab/actions/pending/<id>.json
|
||||||
|
→ Telegram notification → Operator approves
|
||||||
|
→ /opt/homelab/actions/approved/<id>.json
|
||||||
|
→ Executor runs → completed / failed
|
||||||
|
```
|
||||||
|
|
||||||
|
Agents must never execute destructive actions (restarts, deploys, config changes) without a corresponding approved action file.
|
||||||
|
|
||||||
|
## Event System
|
||||||
|
|
||||||
|
Events are append-only JSON lines at `/opt/homelab/events/YYYY-MM-DD/<node>/events.jsonl`.
|
||||||
|
|
||||||
|
Emit via `scripts/lib/events.sh` (shell) or `scripts/lib/events.py` (Python).
|
||||||
|
|
||||||
|
Normalized event types: `deployment_started/completed/failed`, `service_unhealthy/recovered`, `node_offline/online`, `healthcheck_failed`, `remediation_started/completed`.
|
||||||
|
|
||||||
|
### Supervisor event routing table
|
||||||
|
|
||||||
|
| Event type | Source | Action generated | Cooldown |
|
||||||
|
|---|---|---|---|
|
||||||
|
| `containers_not_running` | stability-agent | `container_restart` | dedup via stable ID |
|
||||||
|
| `mqtt_unreachable` | stability-agent | `container_restart` | dedup via stable ID |
|
||||||
|
| `service_unhealthy` / other | stability-agent | `redeploy` | dedup via stable ID |
|
||||||
|
| `disk_pressure` (high) | stability-agent | `disk_cleanup` | dedup via stable ID |
|
||||||
|
| `ha_websocket_dead` | ha-diag-agent | `container_restart` (homeassistant) | 30 min after completion |
|
||||||
|
| `ha_websocket_recovered` | ha-diag-agent | cancels matching restart | — |
|
||||||
|
| `ha_integration_failed` | ha-diag-agent | `alert_only` | 1 hour |
|
||||||
|
| `ha_entity_unavailable_long` | ha-diag-agent | `alert_only` | 1 hour |
|
||||||
|
| `ha_automation_failing` | ha-diag-agent | `alert_only` | 1 hour |
|
||||||
|
| `ha_update_available` | ha-diag-agent | `alert_only` | 1 hour |
|
||||||
|
| `ha_recorder_lag` | ha-diag-agent | `alert_only` | 1 hour |
|
||||||
|
| `ha_system_health_degraded` | ha-diag-agent | `alert_only` | 1 hour |
|
||||||
|
|
||||||
|
HA events are routed directly from the events directory by the supervisor (not via world-state drift loop) to avoid conflicts with stability-agent's independent container health tracking. HA events are suppressed if `homeassistant` had a `containers_not_running` incident within the last 5 minutes (planned restart/update in progress).
|
||||||
|
|
||||||
|
## Discovery Entry Points for Agents
|
||||||
|
|
||||||
|
When exploring the system, use these files in order:
|
||||||
|
1. `inventory/topology.yaml` — node list, roles, mesh type
|
||||||
|
2. `hosts/<node>/capabilities.yaml` — hardware and software constraints
|
||||||
|
3. `hosts/<node>/services.yaml` — desired services and exposure classes for that host
|
||||||
|
4. `services/<service>/service.yaml` — operational contract for a service
|
||||||
|
|
||||||
|
## VPS-Specific Rules
|
||||||
|
|
||||||
|
VPS has **4 GiB RAM, no swap**. Every repo-managed service must declare memory limits in its `hosts/vps/runtime/<service>/docker-compose.override.yml`.
|
||||||
|
|
||||||
|
### Memory limit convention
|
||||||
|
|
||||||
|
Use top-level Compose properties (not `deploy.resources.limits`, which requires Swarm mode):
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
services:
|
||||||
|
myservice:
|
||||||
|
mem_limit: 256m # cgroup ceiling; Docker restarts on breach
|
||||||
|
oom_score_adj: -900 # host kernel OOM-killer will not pick this container
|
||||||
|
```
|
||||||
|
|
||||||
|
Rules:
|
||||||
|
- **Control-plane containers** (executor, observer, supervisor, operator-ui), **node-agent**, **stability-agent**: always set `oom_score_adj: -900` — these must never be a system-level OOM victim.
|
||||||
|
- `mem_limit` still applies even with `oom_score_adj: -900`; the cgroup OOM killer is independent of the host OOM killer and will restart the container via Docker when the limit is exceeded.
|
||||||
|
- Budget: OS+Docker reserves ~800 MiB; sum of all `mem_limit` values must stay ≤ 3200 MiB (3.1 GiB).
|
||||||
|
|
||||||
|
### Repo-managed services on VPS
|
||||||
|
|
||||||
|
All VPS services are now GitOps-managed. Service definitions live in `services/<name>/docker-compose.yml`; host-specific overrides (mem_limit, env) live in `hosts/vps/runtime/<name>/docker-compose.override.yml`.
|
||||||
|
|
||||||
|
| Service | Compose stack | Data path |
|
||||||
|
|---|---|---|
|
||||||
|
| npm | `services/npm/` | `/home/dockeruser/docker/npm/{data,letsencrypt}` (bind mount) |
|
||||||
|
| outline | `services/outline/` | Docker named volumes: `outline_outline_storage`, `outline_postgres_data`, `outline_redis_data` |
|
||||||
|
| joplin | `services/joplin/` | Docker named volume: `joplin_postgres_data` |
|
||||||
|
| ai-cluster | `services/ai-cluster/` | Mosquitto config bind: `/home/dockeruser/docker/ai-cluster/mosquitto/` |
|
||||||
|
|
||||||
|
**Data migration rule**: data paths stay in place at cutover. Never move volumes or bind-mount sources without a dedicated migration plan.
|
||||||
|
|
||||||
|
**Cutover checklist** (before running `docker compose up` for any migrated service):
|
||||||
|
1. `git pull` on VPS
|
||||||
|
2. Populate `/opt/homelab/config/<service>/.env` from the `env.example` template
|
||||||
|
3. For ai-cluster: copy `/home/dockeruser/docker/ai-cluster/.env` to `/opt/homelab/config/ai-cluster/.env`
|
||||||
|
4. For mosquitto: config stays at old bind path until explicitly migrated
|
||||||
|
5. Verify named volumes exist: `docker volume ls | grep <project>`
|
||||||
|
|
||||||
|
**ai-cluster architectural note**: compute workloads (codex-worker, planner-worker) belong on SOLARIA (GPU/compute node), not the 4 GB ingress VPS. Migrate when feasible; for now, hard mem_limits contain the blast radius.
|
||||||
|
|
||||||
|
## CHELSTY-Specific Rules
|
||||||
|
|
||||||
|
- Zigbee coordinator is **SLZB-06U** over TCP (`192.168.1.105:6638`, `ezsp` adapter). Never use `/dev/ttyUSB0`.
|
||||||
|
- CHELSTY nodes run **docker-compose v1** (1.29.2) — use `docker-compose` (hyphenated), not `docker compose`.
|
||||||
|
- Critical backup sets: HA config+data, Zigbee2MQTT config+db+network key, Mosquitto config+persistence, SLZB-06U coordinator state.
|
||||||
|
|
||||||
|
## Runtime Path Conventions
|
||||||
|
|
||||||
|
`/opt/homelab/` layout on each node:
|
||||||
|
|
||||||
|
- `data/<service>/` — persistent volumes
|
||||||
|
- `config/<service>/` — secrets and host-local overrides (not in Git)
|
||||||
|
- `logs/<service>/` — service logs
|
||||||
|
- `state/` — deployment stage markers, agent heartbeats
|
||||||
|
- `events/` — append-only event store
|
||||||
|
- `world/` — Observer output (synthesized state)
|
||||||
|
- `actions/` — pending / approved / running / completed / failed
|
||||||
|
|
||||||
|
## Definition of Done (serwisy)
|
||||||
|
|
||||||
|
Before any new or changed service is considered ready:
|
||||||
|
|
||||||
|
1. **docker build + smoke run** — build the image locally and run it for a few seconds; confirm the process starts its main loop without crashing. This catches packaging/import errors (e.g. `ModuleNotFoundError`) before they reach a node.
|
||||||
|
2. **pytest** — run the service's test suite. If no tests exist yet, add a minimal one (at minimum: import passes, core logic has at least one case). Tests live in `services/<service>/tests/`.
|
||||||
|
3. **Never commit or deploy code that has never been run.** If a smoke run or test fails, fix it first.
|
||||||
|
|
||||||
|
## Naming Conventions
|
||||||
|
|
||||||
|
- Hosts: ALL CAPS (`SATURN`, `PIHA`)
|
||||||
|
- Services: kebab-case (`stability-agent`, `zigbee2mqtt`)
|
||||||
|
- Container names must match service names
|
||||||
|
- Always `restart: unless-stopped` unless `service.yaml` says otherwise
|
||||||
|
|
||||||
|
## Multi-agent worktree mode
|
||||||
|
|
||||||
|
`~/homelab-codex-ws` (main checkout) is **deploy-only** and belongs to the human operator.
|
||||||
|
Parallel agent tasks run in isolated git worktrees created by `scripts/dev/agent.sh new <name>`.
|
||||||
|
|
||||||
|
If `.agent-task` exists in your current working directory, you are in a task worktree.
|
||||||
|
**You must immediately read `.agent-task` and load `.claude/skills/worktree-aware/SKILL.md`
|
||||||
|
before taking any action.** That skill defines all branch-hygiene rules for task worktrees.
|
||||||
|
|
||||||
|
Worktree lifecycle commands: `agent.sh new | list | merge | clean`.
|
||||||
|
Agents never invoke these — only the human does.
|
||||||
19
README.md
19
README.md
|
|
@ -13,6 +13,22 @@ The homelab consists of several nodes connected via a Tailscale internal mesh.
|
||||||
| **PIHA** | Infra Node | Core infrastructure services, automation, and monitoring. |
|
| **PIHA** | Infra Node | Core infrastructure services, automation, and monitoring. |
|
||||||
| **VPS** | Edge Node | Public ingress, reverse proxy, and edge services. |
|
| **VPS** | Edge Node | Public ingress, reverse proxy, and edge services. |
|
||||||
|
|
||||||
|
## Agent System
|
||||||
|
|
||||||
|
The homelab uses a multi-agent orchestration model with human-in-the-loop for destructive actions:
|
||||||
|
|
||||||
|
| Agent | Node | Role |
|
||||||
|
|-------|------|------|
|
||||||
|
| **stability-agent** | all nodes | Per-node watchdog — monitors Docker, disk, Tailscale, MQTT; emits events |
|
||||||
|
| **node-agent** | all nodes | Publishes container health events to Redis pub/sub |
|
||||||
|
| **observer** | VPS | Synthesizes world state from events into `/opt/homelab/world/*.json` |
|
||||||
|
| **supervisor** | VPS | Detects drift between desired and actual state; writes `pending` actions |
|
||||||
|
| **planner-agent** | SOLARIA | LLM-powered diagnosis — listens to Redis, proposes remediation actions |
|
||||||
|
| **executor** | VPS | Executes actions only after operator approval |
|
||||||
|
| **operator-ui** + **telegram-bot** | VPS / PIHA | Operator reviews and approves/rejects pending actions |
|
||||||
|
|
||||||
|
Action approval flow: `pending/` → operator approves → `approved/` → executor runs.
|
||||||
|
|
||||||
## Repository Structure
|
## Repository Structure
|
||||||
|
|
||||||
- `docs/`: [Infrastructure Standards](docs/standards.md) and [Deployment Conventions](docs/deployment.md).
|
- `docs/`: [Infrastructure Standards](docs/standards.md) and [Deployment Conventions](docs/deployment.md).
|
||||||
|
|
@ -29,10 +45,13 @@ The homelab consists of several nodes connected via a Tailscale internal mesh.
|
||||||
## Documentation Index
|
## Documentation Index
|
||||||
|
|
||||||
- [Infrastructure Standards](docs/standards.md)
|
- [Infrastructure Standards](docs/standards.md)
|
||||||
|
- [Agent Operating Procedures](docs/agents.md) (For AI/Non-Human Agents)
|
||||||
- [Deployment Conventions](docs/deployment.md)
|
- [Deployment Conventions](docs/deployment.md)
|
||||||
- [Hardware](docs/hardware.md)
|
- [Hardware](docs/hardware.md)
|
||||||
- [Networking](docs/networking.md)
|
- [Networking](docs/networking.md)
|
||||||
- [Services](docs/services.md)
|
- [Services](docs/services.md)
|
||||||
|
- [Node Capabilities](docs/capabilities.md)
|
||||||
|
- [Action Model](services/agent-system/action-model.md)
|
||||||
|
|
||||||
---
|
---
|
||||||
*Note: This repository documents the state of the homelab. Runtime state lives outside the repository in `/opt/homelab`.*
|
*Note: This repository documents the state of the homelab. Runtime state lives outside the repository in `/opt/homelab`.*
|
||||||
|
|
|
||||||
31
backups/zigbee/coordinator_backup.json
Normal file
31
backups/zigbee/coordinator_backup.json
Normal file
|
|
@ -0,0 +1,31 @@
|
||||||
|
{
|
||||||
|
"metadata": {
|
||||||
|
"format": "zigpy/open-coordinator-backup",
|
||||||
|
"version": 1,
|
||||||
|
"source": "zigbee-herdsman@10.0.7",
|
||||||
|
"internal": {
|
||||||
|
"date": "2026-05-14T14:48:35.098Z",
|
||||||
|
"znpVersion": 1
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"stack_specific": {
|
||||||
|
"zstack": {
|
||||||
|
"tclk_seed": "32d69cbe3f0e15471e5d43f9401e485a"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"coordinator_ieee": "00124b00257bf416",
|
||||||
|
"pan_id": "46bc",
|
||||||
|
"extended_pan_id": "087730b5f614ea4a",
|
||||||
|
"nwk_update_id": 0,
|
||||||
|
"security_level": 5,
|
||||||
|
"channel": 11,
|
||||||
|
"channel_mask": [
|
||||||
|
11
|
||||||
|
],
|
||||||
|
"network_key": {
|
||||||
|
"key": "049909949a950d91522cf10cc369a724",
|
||||||
|
"sequence_number": 0,
|
||||||
|
"frame_counter": 0
|
||||||
|
},
|
||||||
|
"devices": []
|
||||||
|
}
|
||||||
49
docs/agents.md
Normal file
49
docs/agents.md
Normal file
|
|
@ -0,0 +1,49 @@
|
||||||
|
# Agent Operating Procedures
|
||||||
|
|
||||||
|
This document defines the operating procedures, constraints, and interaction protocols for non-human agents (AI agents, autonomous scripts) within the Homelab Codex ecosystem.
|
||||||
|
|
||||||
|
## 1. Core Principles for Agents
|
||||||
|
|
||||||
|
1. **Read-Only by Default**: Agents should assume read-only access to the `/opt/homelab` runtime unless explicitly executing an approved action.
|
||||||
|
2. **Git as Authority**: The repository on **SATURN** is the source of truth. Agents must not modify the runtime state on nodes directly without corresponding (or pending) Git state, unless it's an emergency mitigation.
|
||||||
|
3. **Human-in-the-Loop (HIL)**: All destructive or structural changes (restarts, deployments, config changes) must follow the [Action Approval Model](../services/agent-system/action-model.md).
|
||||||
|
4. **Idempotency**: All scripts and actions proposed or executed by agents MUST be idempotent.
|
||||||
|
5. **Context-Awareness**: Agents MUST read the `README.md` and `docs/agents.md` at the start of every session to align with current infrastructure standards.
|
||||||
|
|
||||||
|
## 2. Agent Roles
|
||||||
|
|
||||||
|
| Role | Responsibility | Scope |
|
||||||
|
|------|----------------|-------|
|
||||||
|
| **Observer** | Monitors health, logs, and events. | Read-only access to `/opt/homelab/events` and `logs`. |
|
||||||
|
| **Stability Agent** | Local node watchdog, event emitter. | Local node runtime, `service.yaml` healthchecks. |
|
||||||
|
| **Orchestrator** | High-level planning, workload placement. | Repository-wide, multi-node topology. |
|
||||||
|
| **Materializer** | Translates high-level intent into Docker/System state. | Execution of `approved` actions. |
|
||||||
|
|
||||||
|
## 3. Discovery Protocol
|
||||||
|
|
||||||
|
Agents must use the following entry points to understand the system:
|
||||||
|
|
||||||
|
1. **Topology**: `inventory/topology.yaml` for node list and roles.
|
||||||
|
2. **Capabilities**: `hosts/<node>/capabilities.yaml` to understand hardware/software constraints.
|
||||||
|
3. **Service Contract**: `services/<service>/service.yaml` to understand how to check health and manage a service.
|
||||||
|
4. **Operational State**: `/opt/homelab/state/` on local nodes for real-time status.
|
||||||
|
|
||||||
|
## 4. Interaction with Humans
|
||||||
|
|
||||||
|
Agents communicate with the operator via the `agent-system/telegram-bot`.
|
||||||
|
|
||||||
|
- **Alerting**: Agents emit events to the event system. Critical events are forwarded to Telegram.
|
||||||
|
- **Proposals**: When an agent identifies a need for change (e.g., "Service X is failing, suggest restart"), it creates a `pending` action in `/opt/homelab/actions/pending/`.
|
||||||
|
- **Approval**: Agents must wait for the action status to transition to `approved` before execution.
|
||||||
|
|
||||||
|
## 5. Decision Logic (Reasoning)
|
||||||
|
|
||||||
|
When making decisions, agents MUST prioritize:
|
||||||
|
1. **Safety**: Do not violate power constraints (see `capabilities.yaml`).
|
||||||
|
2. **Stability**: Prefer keeping services on their `owner_node` unless it's down.
|
||||||
|
3. **Connectivity**: On intermittent nodes (CHELSTY), avoid actions requiring heavy WAN traffic during low-signal periods.
|
||||||
|
|
||||||
|
## 6. Access Control for Agents
|
||||||
|
|
||||||
|
- **Filesystem**: Agents should run as the `homelab` user or equivalent with restricted sudo access to `docker compose`.
|
||||||
|
- **Secrets**: Agents MUST NOT attempt to read `.env` files unless specifically tasked with credential rotation. They should treat secrets as opaque handles.
|
||||||
|
|
@ -83,3 +83,10 @@ Future autonomous agents will use this metadata to:
|
||||||
2. **Generate Plans:** Create step-by-step deployment or migration plans based on hardware compatibility.
|
2. **Generate Plans:** Create step-by-step deployment or migration plans based on hardware compatibility.
|
||||||
3. **Validate Topology:** Ensure that a proposed multi-node setup doesn't violate networking or operational constraints (e.g., don't put a DB on an intermittent node).
|
3. **Validate Topology:** Ensure that a proposed multi-node setup doesn't violate networking or operational constraints (e.g., don't put a DB on an intermittent node).
|
||||||
4. **Propose Failover:** Automatically suggest the best alternative node during an outage.
|
4. **Propose Failover:** Automatically suggest the best alternative node during an outage.
|
||||||
|
|
||||||
|
## Agent Reasoning Logic
|
||||||
|
|
||||||
|
When an agent parses `capabilities.yaml`, it should apply these heuristics:
|
||||||
|
- **Intermittent Connectivity**: If `operational.connectivity == "intermittent"`, do not schedule high-bandwidth syncs or critical cloud-dependent services.
|
||||||
|
- **Power Constraints**: If `operational.power_constraint == "low-power"`, avoid heavy LLM inference or continuous high-CPU tasks.
|
||||||
|
- **Availability Target**: If `availability_target == "high"`, this node is a candidate for hosting control-plane failovers.
|
||||||
|
|
|
||||||
|
|
@ -1,60 +1,154 @@
|
||||||
# CHELSTY Runtime
|
# CHELSTY Runtime
|
||||||
|
|
||||||
This document describes the runtime environment and deployment flow for CHELSTY, an offline-capable home automation edge node.
|
This document describes the runtime environment and deployment flow for CHELSTY, an offline-capable home automation edge node split across two VMs.
|
||||||
|
|
||||||
|
| Node | Role | Services |
|
||||||
|
|------|------|----------|
|
||||||
|
| `chelsty-infra` | LTE edge hypervisor | Mosquitto, Zigbee2MQTT, stability-agent, node-agent |
|
||||||
|
| `chelsty-ha` | Home Assistant VM | homeassistant (no node-agent — see below) |
|
||||||
|
|
||||||
|
Both nodes share an LTE uplink and must function fully offline (Zigbee, MQTT, HA automations) without any connectivity to SATURN, VPS, or Forgejo.
|
||||||
|
|
||||||
## Runtime Layout
|
## Runtime Layout
|
||||||
|
|
||||||
The CHELSTY runtime is located at `/opt/homelab`.
|
```
|
||||||
|
/opt/homelab/
|
||||||
- `/opt/homelab/config/`: Service-specific configurations and compose overrides.
|
├── config/ # Service-specific configs and secrets (not in Git)
|
||||||
- `/opt/homelab/data/`: Persistent data for services.
|
│ ├── mosquitto/
|
||||||
- `/opt/homelab/logs/`: Service logs.
|
│ └── zigbee2mqtt/
|
||||||
|
├── data/ # Persistent service data
|
||||||
### Key Service Locations
|
│ ├── mosquitto/ # Persistence DB, password file
|
||||||
- **Mosquitto**: `/opt/homelab/config/mosquitto/`
|
│ └── zigbee2mqtt/
|
||||||
- **Zigbee2MQTT**: `/opt/homelab/config/zigbee2mqtt/`
|
│ └── data/ # z2m config, coordinator backup, network key
|
||||||
|
└── logs/
|
||||||
|
```
|
||||||
|
|
||||||
## SLZB-06U Integration
|
## SLZB-06U Integration
|
||||||
|
|
||||||
CHELSTY uses a SMLIGHT SLZB-06U Zigbee coordinator connected via Ethernet/TCP.
|
CHELSTY uses a SMLIGHT SLZB-06U Zigbee coordinator connected over Ethernet/TCP.
|
||||||
|
|
||||||
- **Coordinator IP**: 192.168.1.105
|
- **Coordinator IP**: `192.168.1.105`
|
||||||
- **Port**: 6638
|
- **Port**: `6638`
|
||||||
- **Protocol**: TCP (ezsp adapter)
|
- **Adapter**: `ezsp` (deprecated — migration to `ember` recommended, requires only changing `adapter: ember` in `configuration.yaml`)
|
||||||
|
- **Zigbee2MQTT config key**: `serial.port: tcp://192.168.1.105:6638`
|
||||||
|
|
||||||
Zigbee2MQTT is configured to connect to this coordinator over the local network.
|
⚠️ Never use `/dev/ttyUSB0` — the coordinator is always TCP-only on this site.
|
||||||
|
|
||||||
## Offline & LTE Assumptions
|
## Networking Constraints
|
||||||
|
|
||||||
- **WAN Resilience**: All core automation (Zigbee, MQTT) runs locally on CHELSTY.
|
### Mosquitto — `network_mode: host`
|
||||||
- **Connectivity**: LTE provides intermittent uplink for remote management and Tailscale access.
|
Mosquitto runs with `network_mode: host` so that all containers on the same host can reach it at `localhost:1883`. **Do not change this.**
|
||||||
- **Home Assistant**: Runs in a separate VM, connecting to the Mosquitto broker on CHELSTY.
|
|
||||||
|
### Zigbee2MQTT — bridge network + extra_hosts
|
||||||
|
Zigbee2MQTT runs in a bridge-networked container (needed for port mapping compatibility with docker-compose v1). To reach the host-networked Mosquitto:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# hosts/chelsty-infra/runtime/zigbee2mqtt/docker-compose.override.yml
|
||||||
|
services:
|
||||||
|
zigbee2mqtt:
|
||||||
|
extra_hosts:
|
||||||
|
- "mosquitto:host-gateway"
|
||||||
|
```
|
||||||
|
|
||||||
|
This maps the `mosquitto` hostname inside the z2m container to the Docker host gateway IP, so `mqtt://mosquitto:1883` reaches the host-networked Mosquitto process.
|
||||||
|
|
||||||
|
**Why not `network_mode: host` for z2m?**
|
||||||
|
chelsty-infra runs docker-compose v1 (1.29.2). In v1, `network_mode: host` cannot coexist with `ports:` declared in the base `docker-compose.yml` — raises `InvalidArgument`. The `extra_hosts` approach avoids this.
|
||||||
|
|
||||||
|
## Zigbee2MQTT Config Location
|
||||||
|
|
||||||
|
The `configuration.yaml` **must be writable** — z2m migrates and rewrites it on startup. It lives in the data directory:
|
||||||
|
|
||||||
|
```
|
||||||
|
/opt/homelab/data/zigbee2mqtt/data/configuration.yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
This path is mounted read-write by the base `docker-compose.yml`:
|
||||||
|
```yaml
|
||||||
|
volumes:
|
||||||
|
- /opt/homelab/data/zigbee2mqtt/data:/app/data
|
||||||
|
```
|
||||||
|
|
||||||
|
Do **not** mount `configuration.yaml` as a separate `:ro` volume — z2m will fail with `EROFS`.
|
||||||
|
|
||||||
|
### Minimal configuration.yaml
|
||||||
|
```yaml
|
||||||
|
homeassistant: true
|
||||||
|
permit_join: false
|
||||||
|
mqtt:
|
||||||
|
base_topic: zigbee2mqtt
|
||||||
|
server: mqtt://mosquitto:1883
|
||||||
|
serial:
|
||||||
|
port: tcp://192.168.1.105:6638
|
||||||
|
adapter: ezsp
|
||||||
|
frontend:
|
||||||
|
port: 8080
|
||||||
|
advanced:
|
||||||
|
log_level: info
|
||||||
|
```
|
||||||
|
|
||||||
|
## chelsty-ha — No node-agent
|
||||||
|
|
||||||
|
`chelsty-ha` does not have a node-agent deployed. Home Assistant is monitored indirectly: if MQTT goes silent on `chelsty-infra`, HA is likely down.
|
||||||
|
|
||||||
|
In `hosts/chelsty-ha/services.yaml`:
|
||||||
|
```yaml
|
||||||
|
services:
|
||||||
|
homeassistant:
|
||||||
|
monitor: false # No node-agent; suppresses supervisor action generation
|
||||||
|
```
|
||||||
|
|
||||||
|
Remove `monitor: false` once node-agent is bootstrapped on this VM.
|
||||||
|
|
||||||
## Deployment Flow
|
## Deployment Flow
|
||||||
|
|
||||||
1. **Initial Bootstrap**:
|
### Initial Bootstrap
|
||||||
Run the bootstrap script on the CHELSTY node:
|
|
||||||
```bash
|
```bash
|
||||||
./scripts/bootstrap/chelsty-runtime.sh
|
./scripts/bootstrap/chelsty-runtime.sh
|
||||||
```
|
```
|
||||||
|
|
||||||
2. **Manual Configuration**:
|
### Deploy services
|
||||||
- Edit `/opt/homelab/config/zigbee2mqtt/.env` with MQTT credentials.
|
|
||||||
- Add Mosquitto user:
|
|
||||||
```bash
|
```bash
|
||||||
sudo mosquitto_passwd -b /opt/homelab/data/mosquitto/config/password.txt <user> <password>
|
./scripts/deploy/deploy-node.sh chelsty-infra
|
||||||
|
./scripts/deploy/deploy-node.sh chelsty-ha
|
||||||
```
|
```
|
||||||
|
|
||||||
3. **Service Deployment**:
|
### Manual (SSH) — chelsty-infra uses docker-compose v1
|
||||||
Use the staged deployment runtime:
|
|
||||||
```bash
|
```bash
|
||||||
./scripts/deploy/deploy-node.sh chelsty
|
ssh oskar@100.122.201.22
|
||||||
|
cd ~/homelab-codex-ws/services/<service>
|
||||||
|
docker-compose -f docker-compose.yml \
|
||||||
|
-f ../../hosts/chelsty-infra/runtime/<service>/docker-compose.override.yml \
|
||||||
|
up -d --build --force-recreate
|
||||||
```
|
```
|
||||||
|
|
||||||
## Recovery Procedure
|
> **Note:** `docker compose` (v2) is **not** available on chelsty-infra — always use `docker-compose` (hyphenated, v1 1.29.2).
|
||||||
|
|
||||||
In case of runtime failure:
|
## Recovery Procedures
|
||||||
1. Verify Docker and Compose plugin: `docker compose version`
|
|
||||||
2. Re-run bootstrap script to ensure directory structure and basic configs.
|
### Mosquitto stopped
|
||||||
3. Check Mosquitto logs: `tail -f /opt/homelab/data/mosquitto/log/mosquitto.log`
|
```bash
|
||||||
4. Verify SLZB-06U reachability: `ping 192.168.1.105`
|
ssh oskar@100.122.201.22 "docker start mosquitto"
|
||||||
|
# Ensure restart policy is correct:
|
||||||
|
docker update --restart unless-stopped mosquitto
|
||||||
|
```
|
||||||
|
|
||||||
|
### Zigbee2MQTT won't start
|
||||||
|
1. Check logs: `docker logs zigbee2mqtt --tail 50`
|
||||||
|
2. Verify SLZB-06U reachable from host: `nc -zv 192.168.1.105 6638`
|
||||||
|
3. Verify config is not empty: `cat /opt/homelab/data/zigbee2mqtt/data/configuration.yaml`
|
||||||
|
4. If config missing, recreate from the minimal template above
|
||||||
|
|
||||||
|
### SLZB-06U unreachable
|
||||||
|
`192.168.1.105:6638` EHOSTUNREACH means the coordinator is offline or the LAN is down. Zigbee2MQTT will keep retrying — no restart needed once the coordinator returns.
|
||||||
|
|
||||||
|
## Critical Backup Sets
|
||||||
|
|
||||||
|
| Data | Path |
|
||||||
|
|------|------|
|
||||||
|
| HA config + DB | `/opt/homelab/data/homeassistant/` on chelsty-ha |
|
||||||
|
| z2m config + coordinator backup + network key | `/opt/homelab/data/zigbee2mqtt/data/` |
|
||||||
|
| Mosquitto persistence + password file | `/opt/homelab/data/mosquitto/` |
|
||||||
|
| SLZB-06U coordinator state | Backup via SLZB-06U web UI at `192.168.1.105` |
|
||||||
|
|
||||||
|
> ⚠️ The Zigbee network key is in `configuration.yaml` or `coordinator_backup.json` — losing it requires re-pairing all devices.
|
||||||
|
|
|
||||||
42
docs/chelsty-stability-agent.md
Normal file
42
docs/chelsty-stability-agent.md
Normal file
|
|
@ -0,0 +1,42 @@
|
||||||
|
### CHELSTY Stability Agent
|
||||||
|
|
||||||
|
The stability-agent on CHELSTY provides local observability and health monitoring for the node's services and infrastructure.
|
||||||
|
|
||||||
|
#### Purpose
|
||||||
|
|
||||||
|
It acts as a filesystem-first watchdog that detects anomalies in the local runtime environment without taking autonomous destructive actions (like restarts). It serves as the primary data source for node-level stability metrics.
|
||||||
|
|
||||||
|
#### Monitoring Scope
|
||||||
|
|
||||||
|
* **Docker Containers**: Monitors all local containers. If a container is not in the `running` state, a `containers_not_running` event is generated.
|
||||||
|
* **Disk Usage**: Monitors the root filesystem. Generates `disk_usage_high` events if usage exceeds the configured threshold.
|
||||||
|
* **Connectivity**:
|
||||||
|
* Checks if the Tailscale socket or interface is available.
|
||||||
|
* Checks reachability of the local Mosquitto MQTT broker.
|
||||||
|
* **Zigbee2MQTT**: Specifically tracks the presence and status of the Zigbee2MQTT service.
|
||||||
|
|
||||||
|
#### Storage and Integration
|
||||||
|
|
||||||
|
* **Heartbeat**: Updated every cycle at `/opt/homelab/state/stability-agent.heartbeat`.
|
||||||
|
* **State Summary**: A JSON summary of all latest checks at `/opt/homelab/state/stability-agent.json`.
|
||||||
|
* **Events**: Append-only JSON lines at `/opt/homelab/events/YYYY-MM-DD/chelsty-infra/events.jsonl`.
|
||||||
|
|
||||||
|
#### Deployment
|
||||||
|
|
||||||
|
The service is deployed via Docker Compose on CHELSTY.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd services/stability-agent
|
||||||
|
docker compose up -d
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Configuration
|
||||||
|
|
||||||
|
Configuration is managed via environment variables in `docker-compose.override.yml` on the host.
|
||||||
|
|
||||||
|
| Variable | Description | Default |
|
||||||
|
|----------|-------------|---------|
|
||||||
|
| `STABILITY_CHECK_INTERVAL` | Seconds between checks | `60` |
|
||||||
|
| `DISK_THRESHOLD_PCT` | Disk usage alert threshold | `90` |
|
||||||
|
| `MQTT_HOST` | MQTT broker hostname | `mosquitto` |
|
||||||
|
| `MQTT_PORT` | MQTT broker port | `1883` |
|
||||||
96
docs/event-system.md
Normal file
96
docs/event-system.md
Normal file
|
|
@ -0,0 +1,96 @@
|
||||||
|
# Homelab Event System
|
||||||
|
|
||||||
|
The homelab multi-agent platform uses a filesystem-first event architecture for observability, auditability, and agent reasoning.
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
Events are stored as individual JSON files on the local filesystem. This ensures that the system is resilient to network outages and requires no external dependencies like databases or message brokers.
|
||||||
|
|
||||||
|
### Filesystem Layout
|
||||||
|
|
||||||
|
Events are organized by date and node:
|
||||||
|
|
||||||
|
```
|
||||||
|
/opt/homelab/events/YYYY-MM-DD/node-name/TIMESTAMP_TYPE_UUID.json
|
||||||
|
```
|
||||||
|
|
||||||
|
- **Date-based partitioning** allows for easy archival and rotation.
|
||||||
|
- **Node-based partitioning** supports multi-node environments and offline synchronization.
|
||||||
|
- **Append-only** nature ensures an immutable audit trail.
|
||||||
|
|
||||||
|
## Event Schema
|
||||||
|
|
||||||
|
Each event is a JSON object with the following fields:
|
||||||
|
|
||||||
|
| Field | Type | Description |
|
||||||
|
|------------------|--------|-------------------------------------------------------|
|
||||||
|
| `timestamp` | string | ISO 8601 UTC timestamp |
|
||||||
|
| `node` | string | Hostname of the node where the event originated |
|
||||||
|
| `type` | string | Normalized event type |
|
||||||
|
| `severity` | string | `info`, `warning`, `error`, `critical` |
|
||||||
|
| `source` | string | Component that emitted the event (e.g., `deploy.sh`) |
|
||||||
|
| `service` | string | Service name or `all` |
|
||||||
|
| `correlation_id` | string | Used to link related events (e.g., deployment run ID) |
|
||||||
|
| `payload` | object | Arbitrary event-specific data |
|
||||||
|
|
||||||
|
### Normalized Event Types
|
||||||
|
|
||||||
|
- `deployment_started`: A deployment process has begun.
|
||||||
|
- `deployment_completed`: A deployment finished successfully.
|
||||||
|
- `deployment_failed`: A deployment failed at some stage.
|
||||||
|
- `service_unhealthy`: A healthcheck failed for a service.
|
||||||
|
- `service_recovered`: A service returned to healthy state.
|
||||||
|
- `node_offline`: Node detected it is losing connectivity (heartbeat loss).
|
||||||
|
- `node_online`: Node detected it is back online.
|
||||||
|
- `healthcheck_failed`: Generic healthcheck failure.
|
||||||
|
- `remediation_started`: An automated or manual fix is being applied.
|
||||||
|
- `remediation_completed`: Remediation finished.
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
### Shell Library
|
||||||
|
|
||||||
|
Source `scripts/lib/events.sh` to use the event library in bash scripts.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
source scripts/lib/events.sh
|
||||||
|
|
||||||
|
# Emit an event
|
||||||
|
emit_event "deployment_started" "info" "my-script.sh" "mosquitto" "unique-cid" '{"version": "1.0"}'
|
||||||
|
|
||||||
|
# List events for today
|
||||||
|
list_events
|
||||||
|
```
|
||||||
|
|
||||||
|
### Python Library
|
||||||
|
|
||||||
|
Import `scripts.lib.events` in Python scripts.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from scripts.lib.events import emit_event
|
||||||
|
|
||||||
|
emit_event(
|
||||||
|
event_type="service_unhealthy",
|
||||||
|
severity="error",
|
||||||
|
source="monitor.py",
|
||||||
|
service="ollama",
|
||||||
|
correlation_id="12345",
|
||||||
|
payload={"error": "OOM"}
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Operator & AI Agent Reasoning
|
||||||
|
|
||||||
|
The event system is designed to support future AI agents:
|
||||||
|
|
||||||
|
1. **Causal Chains**: By using `correlation_id`, agents can trace a failure back to a specific deployment or remediation attempt.
|
||||||
|
2. **Resumable Remediation**: Agents can check the latest `remediation_started` events to see what has already been tried.
|
||||||
|
3. **Auditability**: Every action taken by an operator or agent leaves a permanent record on the filesystem.
|
||||||
|
4. **Offline Capability**: Events are stored locally and can be synced when connectivity is restored.
|
||||||
|
|
||||||
|
## Example Flow: Deployment Failure & Recovery
|
||||||
|
|
||||||
|
1. **Event 1**: `deployment_started` (Type: deployment, CID: `deploy-882`)
|
||||||
|
2. **Event 2**: `deployment_failed` (Type: deployment, CID: `deploy-882`, Payload: `{"stage": "verify", "error": "port 1883 not bound"}`)
|
||||||
|
3. **Event 3**: `remediation_started` (Source: `diagnostics.sh`, CID: `deploy-882`)
|
||||||
|
4. **Event 4**: `service_recovered` (Source: `healthcheck.sh`, Service: `mosquitto`, CID: `deploy-882`)
|
||||||
82
docs/node-onboarding.md
Normal file
82
docs/node-onboarding.md
Normal file
|
|
@ -0,0 +1,82 @@
|
||||||
|
# Node Onboarding Workflow
|
||||||
|
|
||||||
|
This document describes the process of onboarding a new Linux machine into the homelab platform.
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The onboarding process consists of three main stages:
|
||||||
|
1. **Preparation**: Setting up the runtime environment and dependencies.
|
||||||
|
2. **Discovery**: Collecting hardware and software characteristics of the node.
|
||||||
|
3. **Inventory Generation**: Creating the YAML configuration files for the node in the central inventory.
|
||||||
|
|
||||||
|
## Prerequisites
|
||||||
|
|
||||||
|
- A fresh Linux machine (Debian/Ubuntu recommended).
|
||||||
|
- SSH access with sudo privileges.
|
||||||
|
- Tailscale account (if using Tailscale for networking).
|
||||||
|
|
||||||
|
## Onboarding Steps
|
||||||
|
|
||||||
|
### 1. Node Preparation
|
||||||
|
|
||||||
|
Run the `prepare-node.sh` script on the target node. This script will install Docker, Tailscale, and create the `/opt/homelab` directory structure.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo ./scripts/bootstrap/prepare-node.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
**Manual Step**: If you are using Tailscale, you must manually authenticate it after the script runs:
|
||||||
|
```bash
|
||||||
|
sudo tailscale up
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Node Discovery
|
||||||
|
|
||||||
|
Run the `discover-node.sh` script to collect system information. It is recommended to redirect the output to a file.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./scripts/bootstrap/discover-node.sh > discovery-$(hostname).json
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Inventory Generation
|
||||||
|
|
||||||
|
Copy the discovery JSON file to your management machine (where the homelab repository is located) and run the inventory generator.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./scripts/bootstrap/generate-node-inventory.py discovery-node-name.json
|
||||||
|
```
|
||||||
|
|
||||||
|
This will create a new directory in `hosts/<hostname>/` with the following files:
|
||||||
|
- `host.yaml`: Basic host identity and roles.
|
||||||
|
- `capabilities.yaml`: Hardware and software capabilities.
|
||||||
|
- `paths.yaml`: Runtime path definitions.
|
||||||
|
- `networking.yaml`: Networking configuration.
|
||||||
|
|
||||||
|
### 4. Finalization
|
||||||
|
|
||||||
|
1. Review the generated YAML files in `hosts/<hostname>/`.
|
||||||
|
2. Assign appropriate roles to the node in `hosts/<hostname>/host.yaml`.
|
||||||
|
3. Commit the new host configuration to the repository.
|
||||||
|
4. Run the deployment script to apply the initial configuration:
|
||||||
|
```bash
|
||||||
|
./scripts/deploy/deploy-node.sh <hostname>
|
||||||
|
```
|
||||||
|
|
||||||
|
## Recovery Onboarding
|
||||||
|
|
||||||
|
If a node needs to be re-onboarded after a failure:
|
||||||
|
1. Run `prepare-node.sh` again. It is idempotent and will ensure the environment is correct.
|
||||||
|
2. Restore any critical data to `/opt/homelab/data/` and `/opt/homelab/backups/`.
|
||||||
|
3. Re-run `discover-node.sh` if hardware has changed, or reuse the existing inventory if it hasn't.
|
||||||
|
|
||||||
|
## Tailscale Assumptions
|
||||||
|
|
||||||
|
- Nodes are assumed to use Tailscale for management and inter-node communication.
|
||||||
|
- The `networking.yaml` will be populated with the Tailscale IP found during discovery.
|
||||||
|
- If Tailscale is not used, manual adjustment of `networking.yaml` and `host.yaml` is required.
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
- **Docker not starting**: Check `journalctl -u docker`.
|
||||||
|
- **Discovery fails**: Ensure all required tools (lscpu, lsblk, ip, etc.) are installed.
|
||||||
|
- **Inventory Generation error**: Ensure `PyYAML` is installed on the management machine.
|
||||||
98
docs/observer-runtime.md
Normal file
98
docs/observer-runtime.md
Normal file
|
|
@ -0,0 +1,98 @@
|
||||||
|
# Observer Runtime
|
||||||
|
|
||||||
|
The Observer Runtime is a lightweight agent responsible for synthesizing the operational world state of the homelab from raw events, logs, and state files.
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
The observer follows a filesystem-first approach, consuming append-only events and generating a normalized world model. It is designed to be idempotent, resumable, and resilient to intermittent node connectivity.
|
||||||
|
|
||||||
|
### Inputs
|
||||||
|
- `/opt/homelab/events/`: Normalized JSON events (one `.json` file per event, organized by date and node).
|
||||||
|
- `/opt/homelab/state/observer_checkpoint.json`: Per-node checkpoint dict (see below).
|
||||||
|
- Repository Inventory: `inventory/topology.yaml` and `hosts/*/services.yaml`.
|
||||||
|
|
||||||
|
### World Model Output
|
||||||
|
Generated under `/opt/homelab/world/`:
|
||||||
|
- `nodes.json`: Current node availability, roles, disk/memory pressure, last seen timestamps. Dict keyed by node name.
|
||||||
|
- `services.json`: Service health status and links to active incidents. Dict keyed by `"node/service"`.
|
||||||
|
- `deployments.json`: Tracking of active and historical deployment runs by `correlation_id`.
|
||||||
|
- `incidents.json`: Correlated operational issues, including repeat failures and resolution status.
|
||||||
|
- `runtime-summary.json`: High-level overview for dashboards and planner agents.
|
||||||
|
|
||||||
|
## Checkpoint Format
|
||||||
|
|
||||||
|
The observer tracks per-node progress to avoid silently skipping event directories:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"node_checkpoints": {
|
||||||
|
"vps": "/opt/homelab/events/2026-05-27/vps/evt-vps-1234.json",
|
||||||
|
"piha": "/opt/homelab/events/2026-05-27/piha/evt-piha-5678.json",
|
||||||
|
"chelsty-infra": "/opt/homelab/events/2026-05-27/chelsty-infra/evt-chelsty-infra-9012.json"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
A single global checkpoint (`last_processed_file`) was replaced with this per-node dict because the old approach silently skipped any node directory that sorts alphabetically before the last-seen node (e.g. `piha/` would be skipped when the checkpoint pointed to `vps/`).
|
||||||
|
|
||||||
|
**Reset:** Delete `/opt/homelab/state/observer_checkpoint.json`. The observer will reprocess all events and rebuild world state from scratch.
|
||||||
|
|
||||||
|
## Event Types
|
||||||
|
|
||||||
|
### Negative events (create/escalate incidents)
|
||||||
|
- `service_unhealthy`, `healthcheck_failed` — open or increment an active incident
|
||||||
|
- `deployment_failed` — record failure in deployments.json
|
||||||
|
|
||||||
|
### Positive events (resolve state)
|
||||||
|
- `service_healthy` — marks service status as `healthy` **and** resolves any active incident for that service
|
||||||
|
- `service_recovered` — alias, same effect
|
||||||
|
- `deployment_completed` — marks deployment as completed
|
||||||
|
|
||||||
|
### Node events
|
||||||
|
- `node_online`, `node_offline` — update node status in nodes.json
|
||||||
|
- `disk_pressure_*` — set `disk_pressure` field on the node record
|
||||||
|
|
||||||
|
## Incident Lifecycle
|
||||||
|
|
||||||
|
1. **Detection**: A `service_unhealthy` or `healthcheck_failed` event creates or increments an active incident.
|
||||||
|
2. **Correlation**: Multiple failure events for the same `node/service` are collapsed into one incident, incrementing `occurrence_count`.
|
||||||
|
3. **Resolution**: A `service_healthy` or `service_recovered` event resolves any active incident for that service, setting `status: resolved` and `resolved_at`.
|
||||||
|
4. **Expiry**: Resolved incidents older than 7 days are pruned from world state by `_prune_stale_world()`.
|
||||||
|
|
||||||
|
### Example Incident JSON
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"inc-1715518800-vps-observer": {
|
||||||
|
"id": "inc-1715518800-vps-observer",
|
||||||
|
"node": "vps",
|
||||||
|
"service": "observer",
|
||||||
|
"status": "resolved",
|
||||||
|
"severity": "error",
|
||||||
|
"started_at": 1715518800.0,
|
||||||
|
"last_occurrence": 1715518860.0,
|
||||||
|
"occurrence_count": 2,
|
||||||
|
"trigger_type": "containers_not_running",
|
||||||
|
"resolved_at": 1715519100.0
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## World State Pruning
|
||||||
|
|
||||||
|
`_prune_stale_world()` runs every reconcile cycle and removes:
|
||||||
|
|
||||||
|
1. **Stale nodes** — nodes not present in `inventory/topology.yaml` (e.g. ghost nodes created when `NODE_NAME` was unset and fell back to the container's 12-char hex ID).
|
||||||
|
2. **Services of stale nodes** — all `node/service` keys whose node was pruned.
|
||||||
|
3. **Ghost service keys** — service keys whose service-name portion matches the pattern `<12hexchars>_<name>` (Docker internal stale-state artifacts, created when node-agent used `c.name` instead of the compose label).
|
||||||
|
4. **Expired incidents** — resolved incidents older than 7 days.
|
||||||
|
|
||||||
|
## Runtime Behavior
|
||||||
|
|
||||||
|
### Idempotency
|
||||||
|
The observer processes events in order. Deleting the checkpoint and restarting replays all events and produces the same world state.
|
||||||
|
|
||||||
|
### Deployment Tracking
|
||||||
|
Deployments are tracked via `correlation_id`. The observer synthesizes the start, end, and status of each deployment run from events.
|
||||||
|
|
||||||
|
### Topology Filtering
|
||||||
|
Events from nodes not listed in `inventory/topology.yaml` are discarded during pruning. This prevents transient bootstrap noise from polluting world state.
|
||||||
234
docs/sessions/2026-05-27-planner-agent.md
Normal file
234
docs/sessions/2026-05-27-planner-agent.md
Normal file
|
|
@ -0,0 +1,234 @@
|
||||||
|
# SESSION: Budowa planner-agent — LLM-based diagnostics
|
||||||
|
|
||||||
|
**DATA:** 2026-05-27
|
||||||
|
**REZULTAT:** planner-agent działa na SOLARIA (`healthy`), Ollama primary, cloud fallback gotowy do włączenia
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Co zostało zbudowane
|
||||||
|
|
||||||
|
### `services/planner-agent/src/llm_router.py`
|
||||||
|
|
||||||
|
Moduł LLM routing z local-first fallback chain:
|
||||||
|
|
||||||
|
- **`LLMRouter`** — główna klasa routingu przez litellm
|
||||||
|
- **`ModelConfig`** — konfiguracja jednego modelu (name, timeout, api_base, extra_kwargs)
|
||||||
|
- **`ModelMetrics`** — liczniki per model × outcome (`success`/`fallback`/`error`); success_rate
|
||||||
|
- **`RouteResult`** — wynik routingu z `content`, `model_used`, `attempts`, `latency_ms`
|
||||||
|
- **`AttemptRecord`** — zapis jednej próby (model, outcome, reason, latency_ms)
|
||||||
|
- **`_extract_json_from_fence()`** — wydobywa JSON z bloków ` ```json ``` ` jeśli model nie odpowie czystym JSON
|
||||||
|
|
||||||
|
Domyślny chain: `ollama/qwen2.5:7b` (8s) → `claude-haiku-4-5-20251001` (30s) → `claude-sonnet-4-6` (30s)
|
||||||
|
|
||||||
|
Metryki każdego wywołania publikowane na Redis kanał `llm_router_metrics`.
|
||||||
|
|
||||||
|
### `services/planner-agent/src/planner.py`
|
||||||
|
|
||||||
|
Główna pętla agenta:
|
||||||
|
|
||||||
|
- **`PlannerAgent`** — async agent: Redis sub → diagnoza LLM → pending action file → event
|
||||||
|
- **`HealthEvent`** — znormalizowane zdarzenie zdrowotne z Redis (node, service, event_type, severity, payload)
|
||||||
|
- **`ActionProposal`** — propozycja akcji z pełnymi metadanymi; `.to_action_file()` → format executora
|
||||||
|
- **`CooldownTracker`** — gate 5-minutowy per `svc_key` (node/service); NIE rejestruje jeśli LLM się wysypał
|
||||||
|
- **`parse_event()`** — normalizuje dwa formaty wejściowe (node-agent / control-plane)
|
||||||
|
- **`write_pending_action()`** — atomiczny zapis: `.tmp` → rename
|
||||||
|
- **`emit_event()`** — zapis zdarzenia `remediation_started` do systemu plików (bez importów z control-plane)
|
||||||
|
|
||||||
|
Pipeline:
|
||||||
|
```
|
||||||
|
Redis msg → parse_event() → benign skip → cooldown gate → _propose_action() (LLM)
|
||||||
|
→ write_pending_action() → emit_event("remediation_started")
|
||||||
|
```
|
||||||
|
|
||||||
|
### Pliki towarzyszące
|
||||||
|
|
||||||
|
| Plik | Opis |
|
||||||
|
|------|------|
|
||||||
|
| `service.yaml` | Kontrakt operacyjny: owner_node=solaria, deps=redis+ollama, healthcheck=file |
|
||||||
|
| `docker-compose.yml` | env_file + extra_hosts:host-gateway + ANTHROPIC_API_KEY w environment |
|
||||||
|
| `Dockerfile` | python:3.11-slim, litellm, redis, jsonschema, structlog |
|
||||||
|
| `healthcheck.sh` | Sprawdza wiek pliku heartbeat (max 300s) |
|
||||||
|
| `requirements.txt` | litellm, redis, jsonschema, structlog |
|
||||||
|
| `tests/test_planner.py` | 49 testów jednostkowych |
|
||||||
|
| `tests/test_llm_router.py` | 34 testy jednostkowe |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Kluczowe decyzje architektoniczne
|
||||||
|
|
||||||
|
### 1. HITL invariant (Human-in-the-loop)
|
||||||
|
|
||||||
|
Planner **wyłącznie** zapisuje do `actions/pending/`. Executor wymaga pliku w `actions/approved/`.
|
||||||
|
Planner nigdy nie wykona akcji samodzielnie — to fundamentalna zasada systemu.
|
||||||
|
|
||||||
|
Implementacja: `write_pending_action()` pisze do `pending/`, żadna ścieżka w kodzie nie dotyka `approved/`.
|
||||||
|
|
||||||
|
### 2. Cooldown gate
|
||||||
|
|
||||||
|
Per `svc_key` (= `node/service`), domyślnie 5 minut. Cel: nie zalewać operatora powtórzonymi
|
||||||
|
propozycjami dla tego samego serwisu.
|
||||||
|
|
||||||
|
**Kluczowa decyzja:** cooldown NIE jest rejestrowany jeśli cały chain LLM się wysypał.
|
||||||
|
Dzięki temu kolejne zdarzenie może spróbować ponownie, zamiast być cicho zablokowanym
|
||||||
|
przez 5 minut mimo że nie powstała żadna propozycja.
|
||||||
|
|
||||||
|
### 3. Fallback chain — local-first
|
||||||
|
|
||||||
|
Kolejność: Ollama (lokalny GPU) → Haiku → Sonnet.
|
||||||
|
|
||||||
|
Uzasadnienie:
|
||||||
|
- Ollama nie wysyła danych do zewnętrznych serwisów; niskie opóźnienie dla prostych przypadków
|
||||||
|
- Haiku = szybki i tani cloud fallback
|
||||||
|
- Sonnet = ostatnia deska ratunku dla trudnych przypadków
|
||||||
|
|
||||||
|
Odrzucenie modelu na podstawie: timeout, błąd sieci, wzorzec odmowy, invalid JSON, schema error.
|
||||||
|
|
||||||
|
### 4. Brak importów z control-plane
|
||||||
|
|
||||||
|
`services/planner-agent/` jest w pełni samodzielny. Nie importuje nic z
|
||||||
|
`services/control-plane/`. Emisja eventów jest implementowana lokalnie (kopia logiki
|
||||||
|
`scripts/lib/events.py`).
|
||||||
|
|
||||||
|
Uzasadnienie: planner musi działać nawet jeśli control-plane jest offline; oddzielne
|
||||||
|
cykl deploymentu.
|
||||||
|
|
||||||
|
### 5. structlog z PrintLoggerFactory
|
||||||
|
|
||||||
|
Nie używamy `structlog.stdlib.add_logger_name` — `PrintLogger` nie ma atrybutu `.name`.
|
||||||
|
Zamiast tego łańcuch procesorów: `add_log_level` → `TimeStamper` → `StackInfoRenderer`
|
||||||
|
→ `format_exc_info` → `JSONRenderer`.
|
||||||
|
|
||||||
|
### 6. NODE_NAME czytany w czasie wywołania, nie importu
|
||||||
|
|
||||||
|
`_emit_event_sync` czyta `NODE_NAME` z modułowego `NODE_NAME` przy każdym wywołaniu
|
||||||
|
(nie jako default parameter). Umożliwia patchowanie w testach.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Problemy napotkane i rozwiązania
|
||||||
|
|
||||||
|
### Problem: `localhost` w kontenerze nie sięga do hosta
|
||||||
|
|
||||||
|
**Kontekst:** Ollama działa na SOLARIA pod `localhost:11434`. Kontener Docker
|
||||||
|
z domyślną siecią bridge nie może sięgnąć do hosta przez `localhost`.
|
||||||
|
|
||||||
|
**Rozwiązanie:**
|
||||||
|
1. Dodano `extra_hosts: - "host-gateway:host-gateway"` do docker-compose.yml
|
||||||
|
2. `.env` używa `OLLAMA_HOST=http://host-gateway:11434`
|
||||||
|
|
||||||
|
### Problem: `environment` vs `env_file` — podwójne zmienne
|
||||||
|
|
||||||
|
**Kontekst:** Pierwsza wersja docker-compose.yml miała wszystkie zmienne hardkodowane
|
||||||
|
w sekcji `environment` z fallback wartościami (`${VAR:-default}`). Powodowało to
|
||||||
|
że `.env` był opcjonalny a nie wymagany.
|
||||||
|
|
||||||
|
**Rozwiązanie:** Usunięto wszystkie zmienne runtime z `environment`, przeniesiono do `env_file`.
|
||||||
|
Pozostał tylko `ANTHROPIC_API_KEY` w `environment` (opcjonalny sekret, nie powinien być w pliku na dysku).
|
||||||
|
|
||||||
|
### Problem: `structlog.stdlib.add_logger_name` crashuje z PrintLogger
|
||||||
|
|
||||||
|
**Symptom:** `AttributeError: 'PrintLogger' object has no attribute 'name'`
|
||||||
|
|
||||||
|
**Rozwiązanie:** Usunięto `add_logger_name` z łańcucha procesorów. Nie jest
|
||||||
|
kompatybilny z `PrintLoggerFactory`.
|
||||||
|
|
||||||
|
### Problem: verify stage failuje zaraz po starcie
|
||||||
|
|
||||||
|
**Symptom:** `deploy.sh` raportuje FAILED przy verify bo heartbeat nie istnieje.
|
||||||
|
|
||||||
|
**Przyczyna:** Race condition — agent potrzebuje kilku sekund na uruchomienie
|
||||||
|
pętli i pierwsze `touch()` heartbeatu.
|
||||||
|
|
||||||
|
**Rozwiązanie:** Nie jest to prawdziwy błąd. Docker healthcheck ma `start_period: 30s`.
|
||||||
|
Kontener pokazuje `(healthy)` po 30s od startu.
|
||||||
|
|
||||||
|
### Problem: git pull z divergent branches na solaria
|
||||||
|
|
||||||
|
**Symptom:** Solaria miała 2 lokalne commity nie będące na Forgejo + ręczne zmiany w working tree.
|
||||||
|
`git pull` failował z "Need to specify how to reconcile divergent branches."
|
||||||
|
|
||||||
|
**Rozwiązanie:**
|
||||||
|
```bash
|
||||||
|
git checkout -- services/planner-agent/docker-compose.yml # porzuć ręczne zmiany
|
||||||
|
git fetch origin
|
||||||
|
git rebase origin/master # rebase local commits on top of master
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Status deploymentu na SOLARIA
|
||||||
|
|
||||||
|
```
|
||||||
|
Container: planner-agent Up ~30m (healthy)
|
||||||
|
Image: planner-agent-planner-agent
|
||||||
|
Node: solaria (100.100.231.104)
|
||||||
|
Heartbeat: /opt/homelab/state/planner-agent.heartbeat (age 0s)
|
||||||
|
|
||||||
|
Channels subscribed:
|
||||||
|
- health_events
|
||||||
|
- world_updates
|
||||||
|
|
||||||
|
LLM chain:
|
||||||
|
PRIMARY: ollama/qwen2.5-coder:14b @ http://host-gateway:11434
|
||||||
|
FALLBACK: claude-haiku-4-5-20251001 (disabled — brak ANTHROPIC_API_KEY)
|
||||||
|
FALLBACK: claude-sonnet-4-6 (disabled — brak ANTHROPIC_API_KEY)
|
||||||
|
|
||||||
|
Redis: redis://100.108.208.3:6379 ✓ connected
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Co zostało na później
|
||||||
|
|
||||||
|
### 1. ANTHROPIC_API_KEY — cloud fallback wyłączony
|
||||||
|
|
||||||
|
Haiku i Sonnet są skonfigurowane w chain ale nie mają klucza API.
|
||||||
|
Gdy Ollama nie da rady (złożony przypadek / timeout), chain się wysypie bez fallbacku.
|
||||||
|
|
||||||
|
Aby włączyć:
|
||||||
|
```bash
|
||||||
|
ssh oskar@100.100.231.104
|
||||||
|
echo "ANTHROPIC_API_KEY=sk-ant-..." >> /opt/homelab/config/planner-agent/.env
|
||||||
|
docker compose -f ~/homelab-codex-ws/services/planner-agent/docker-compose.yml up -d
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. End-to-end test z prawdziwym eventem
|
||||||
|
|
||||||
|
Planner jest podłączony do Redis i nasłuchuje, ale żadne zdarzenie jeszcze nie
|
||||||
|
przeszło przez pełną ścieżkę (LLM call → pending action → operator UI).
|
||||||
|
|
||||||
|
Test:
|
||||||
|
```bash
|
||||||
|
redis-cli -h 100.108.208.3 PUBLISH health_events '{
|
||||||
|
"type": "service_unhealthy",
|
||||||
|
"node": "piha",
|
||||||
|
"service": "mosquitto",
|
||||||
|
"severity": "error",
|
||||||
|
"payload": {"reason": "container exited"},
|
||||||
|
"timestamp": "2026-05-27T20:00:00Z"
|
||||||
|
}'
|
||||||
|
# Obserwuj: docker logs planner-agent -f
|
||||||
|
# Sprawdź: ls /opt/homelab/actions/pending/
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Solaria local commits
|
||||||
|
|
||||||
|
Solaria ma 2 lokalne commity (`feat: add ECC skills`, `fix: remove duplicate CLAUDE.md sections`)
|
||||||
|
które nie są na Forgejo. Zostały zrebase'owane na top of master ale nie wypchnięte.
|
||||||
|
Należy je wypchnąć lub zreviewować i ewentualnie squashować.
|
||||||
|
|
||||||
|
### 4. Integracja z operator UI / Telegram
|
||||||
|
|
||||||
|
Propozycje w `actions/pending/` nie mają jeszcze kanału notyfikacji do operatora.
|
||||||
|
Telegram bot powinien wysyłać powiadomienie gdy pojawi się nowy plik w `pending/`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Commity tej sesji
|
||||||
|
|
||||||
|
```
|
||||||
|
ff6fda1 planner-agent: use env_file, keep only ANTHROPIC_API_KEY in environment
|
||||||
|
ca37fca Add planner-agent: LLM-powered remediation planner
|
||||||
|
(llm_router.py, planner.py, tests, service.yaml, docker-compose.yml,
|
||||||
|
healthcheck.sh, Dockerfile)
|
||||||
|
```
|
||||||
103
docs/sessions/2026-05-27.md
Normal file
103
docs/sessions/2026-05-27.md
Normal file
|
|
@ -0,0 +1,103 @@
|
||||||
|
# SESSION: Stabilizacja systemu wieloagentowego homelabu
|
||||||
|
|
||||||
|
**DATE:** 2026-05-27
|
||||||
|
**RESULT:** System NOMINAL (97/97 services, 0 errors)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## PROBLEMS FOUND
|
||||||
|
|
||||||
|
- stability-agent nie generował akcji naprawczych — tylko redeploy, brak container_restart
|
||||||
|
- mosquitto na chelsty-infra padł i nikt go nie restartował (restart policy był `no`)
|
||||||
|
- zigbee2mqtt nigdy nie był wdrożony na chelsty-infra
|
||||||
|
- node-agent był pustym szkieletem — nie emitował `service_healthy`, więc `services.json` zawsze był pusty
|
||||||
|
- ghost services: node-agent używał `c.name` (może zwrócić `<12hex>_real-name`) zamiast etykiety `com.docker.compose.service`
|
||||||
|
- materializer na piha czytał ze swojego lokalnego Redis zamiast z control-plane API — Redis zawierał 80 przestarzałych wpisów z ghost kluczami; "Copy for AI" zwracał stare dane
|
||||||
|
- observer używał jednego globalnego checkpointu zamiast per-node — cicho pomijał katalogi z eventami sortujące się przed aktualnym checkpointem
|
||||||
|
- supervisor nie cancelował resolved actions — pending queue rósł bez końca
|
||||||
|
- `service_healthy` event nie zamykał aktywnych incydentów
|
||||||
|
- NODE_ALIAS_MAP nie był skonfigurowany — mismatch nazw nodów między eventem a topology
|
||||||
|
- chelsty-ha błędnie w scope monitoringu — nie ma na nim node-agenta
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## FIXES SHIPPED (commits in master)
|
||||||
|
|
||||||
|
```
|
||||||
|
7277bdc Fix Copy for AI: materializer fetches from control-plane API instead of Redis
|
||||||
|
b40b832 Fix ghost service keys from hash-prefixed Docker container names
|
||||||
|
28e9534 observer: service_healthy resolves active incidents
|
||||||
|
46ae92b supervisor: also cancel pending actions for services removed from desired state
|
||||||
|
410bfe7 zigbee2mqtt: config goes in data dir (writable), not separate ro mount
|
||||||
|
b3912fe zigbee2mqtt: use extra_hosts host-gateway instead of network_mode: host
|
||||||
|
61e07f4 zigbee2mqtt override: clear ports list for docker-compose v1 host network compat
|
||||||
|
51002d4 Fix pending actions: node_exporter, zigbee2mqtt, chelsty-ha monitoring
|
||||||
|
fb7828b supervisor: auto-cancel pending actions when drift is resolved
|
||||||
|
2f19657 fix(node-agent): unique event IDs per service to prevent same-second overwrites
|
||||||
|
267742c vps/node-agent: add network_mode: host for control-plane health probe
|
||||||
|
4e8968f Fix service health tracking: emit service_healthy, control-plane endpoint, checkpoint migration
|
||||||
|
f4a8db9 fix(observer): per-node-directory checkpoints replace single global checkpoint
|
||||||
|
a5a3e22 fix(node-agent): skip SSH config file in rsync to avoid UID ownership errors
|
||||||
|
2349de5 fix(node-agent): correct VPS_EVENTS_HOST to actual VPS Tailscale IP
|
||||||
|
65bac4e fix(node-agent): mount host SSH key into container for event shipping
|
||||||
|
96bf326 fix(observer+operator-ui): fix stale world state, dict→list API, event time filter
|
||||||
|
ae33cce feat(node-agent): add runtime overrides for piha, solaria, chelsty-infra
|
||||||
|
c5c080b feat(vps): add node-agent runtime override with NODE_NAME=vps
|
||||||
|
01b7758 feat(node-agent): implement health monitor and safe cleanup policy
|
||||||
|
```
|
||||||
|
|
||||||
|
### Szczegóły kluczowych napraw
|
||||||
|
|
||||||
|
**fix(observer): per-node checkpoints**
|
||||||
|
Jeden globalny checkpoint `last_processed_file` cicho pomijał katalogi eventów sortujące się alfabetycznie przed ostatnim przetworzonym węzłem (np. piha/ < vps/). Zastąpiony słownikiem `{"node_checkpoints": {"piha": "...", "vps": "..."}}` per-node.
|
||||||
|
|
||||||
|
**fix(observer): ghost key pruning**
|
||||||
|
`_prune_stale_world()` teraz usuwa wpisy z services.json których klucz serwisu pasuje do wzorca `<12hexchars>_<name>` — artefakty z Docker internal state tracking.
|
||||||
|
|
||||||
|
**fix(node-agent): canonical container name**
|
||||||
|
`check_containers()` teraz używa `com.docker.compose.service` label jako nazwy kanonicznej. Fallback: strip hash prefix z `c.name`. Kontenery w stanie `created` są pomijane (Docker stale-state artifacts).
|
||||||
|
|
||||||
|
**fix(node-agent): service_healthy emission**
|
||||||
|
Node-agent teraz emituje `service_healthy` dla każdego uruchomionego zarządzanego kontenera co cykl. Bez tego `services.json` był zawsze pusty — supervisor generował flood "missing service" redeployów.
|
||||||
|
|
||||||
|
**fix(supervisor): auto-cancel resolved actions**
|
||||||
|
`_cancel_resolved_pending_actions()` przenosi pending akcje do `cancelled/` gdy:
|
||||||
|
- serwis stał się healthy (`drift_resolved_auto`)
|
||||||
|
- serwis został usunięty z desired state (`service_removed_from_desired_state`)
|
||||||
|
|
||||||
|
**fix(supervisor): monitor:false**
|
||||||
|
Pole `monitor: false` w `services.yaml` wyklucza serwis z generowania akcji supervisora. Używane dla `homeassistant` na chelsty-ha (brak node-agenta).
|
||||||
|
|
||||||
|
**fix(agent-system/materializer): control-plane API as source**
|
||||||
|
Materializer na piha teraz fetchuje dane z VPS control-plane API (`CONTROL_PLANE_URL=http://100.95.58.48:18180`) zamiast z lokalnego Redis. Redis zawierał 80 przestarzałych wpisów. Redis path zachowany jako fallback.
|
||||||
|
|
||||||
|
**fix(chelsty-infra/zigbee2mqtt): mosquitto networking**
|
||||||
|
Mosquitto działa z `network_mode: host` — kontenery bridge nie mogą go dosięgnąć przez localhost. Rozwiązanie: `extra_hosts: - "mosquitto:host-gateway"` w override z2m. Nie używamy `network_mode: host` dla z2m bo koliduje z `ports:` w docker-compose v1 (1.29.2 na chelsty-infra).
|
||||||
|
|
||||||
|
**fix(chelsty-infra/zigbee2mqtt): writable config**
|
||||||
|
z2m migruje i nadpisuje `configuration.yaml` przy starcie. Config musi być w katalogu z danymi: `/opt/homelab/data/zigbee2mqtt/data/configuration.yaml` (read-write mount), nie w osobnym `:ro` wolumenie.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## STAN KOŃCOWY
|
||||||
|
|
||||||
|
| Node | Status | Serwisy |
|
||||||
|
|------|--------|---------|
|
||||||
|
| vps | online | control-plane (4), node-agent, node_exporter, stability-agent |
|
||||||
|
| piha | online | agent-system (4), node-agent, stability-agent, monitoring stack |
|
||||||
|
| solaria | online | node-agent, stability-agent, AI workloads |
|
||||||
|
| chelsty-infra | online | mosquitto, zigbee2mqtt (z2m łączy się gdy SLZB-06U wróci online), node-agent, stability-agent |
|
||||||
|
| chelsty-ha | — | homeassistant (monitor:false — brak node-agenta, HA monitorowane pośrednio przez MQTT) |
|
||||||
|
|
||||||
|
**Action queue:** 0 pending, 0 approved, 0 running
|
||||||
|
**Incidents:** 0 active
|
||||||
|
**Ghost service keys:** 0
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## ZNANE OGRANICZENIA / TODO
|
||||||
|
|
||||||
|
- SLZB-06U (Zigbee coordinator) offline — `192.168.1.105:6638` EHOSTUNREACH z chelsty-infra. Prawdopodobnie problem sprzętowy/sieciowy po stronie 192.168.1.0/24. z2m startuje i serwuje stronę błędu na :8080 — połączy się automatycznie gdy coordinator wróci.
|
||||||
|
- `ezsp` adapter w konfiguracji z2m jest deprecated — zalecana migracja do `ember`. Nie wymaga nowej konfiguracji, tylko zmiana pola `adapter: ember` w `configuration.yaml`.
|
||||||
|
- chelsty-ha nie ma node-agenta. Dodać gdy będzie dostępna maszyna lub manual bootstrap.
|
||||||
|
- Redis na piha nadal zawiera stare klucze `homelab:nodes:*`, `homelab:incidents:*` etc. — nie są już używane przez materializer w trybie API, można wyczyścić.
|
||||||
62
docs/stability-agent-rollout.md
Normal file
62
docs/stability-agent-rollout.md
Normal file
|
|
@ -0,0 +1,62 @@
|
||||||
|
# Stability Agent Multi-Node Rollout
|
||||||
|
|
||||||
|
## Architecture Summary
|
||||||
|
The `stability-agent` is a lightweight Python service that monitors node health (disk, Docker containers, Tailscale, MQTT) and publishes state to a central Redis instance running on **PIHA**.
|
||||||
|
|
||||||
|
- **Source**: `services/stability-agent`
|
||||||
|
- **State Path**: `/opt/homelab/state`
|
||||||
|
- **Events Path**: `/opt/homelab/events`
|
||||||
|
- **Redis Target**: `100.108.208.3:6379` (PIHA)
|
||||||
|
|
||||||
|
## Why UI only showed CHELSTY
|
||||||
|
Previously, the `stability-agent` had `NODE_NAME` defaulted to `chelsty` and was only deployed there. The Agent System UI materializer on PIHA filters nodes based on the Redis keys `homelab:nodes:<NODE_NAME>`. Without other agents publishing their specific `NODE_NAME`, the UI remained limited to the single active node.
|
||||||
|
|
||||||
|
## Deployment
|
||||||
|
|
||||||
|
Use the helper script to deploy or generate commands. The script uses explicit Tailscale IPs for remote targets (piha, chelsty, vps) and runs locally for solaria.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Print commands
|
||||||
|
./scripts/deploy/deploy-stability-agent.sh <node-name>
|
||||||
|
|
||||||
|
# Deploy via SSH (executes ssh oskar@<ip>)
|
||||||
|
./scripts/deploy/deploy-stability-agent.sh <node-name> --ssh
|
||||||
|
```
|
||||||
|
|
||||||
|
### Manual Steps per Node
|
||||||
|
The manual steps are encapsulated in `services/stability-agent/deploy-local.sh`. On the target node:
|
||||||
|
```bash
|
||||||
|
cd /home/oskar/homelab-codex-ws
|
||||||
|
git fetch origin
|
||||||
|
git checkout master
|
||||||
|
git pull origin master
|
||||||
|
cd services/stability-agent
|
||||||
|
./deploy-local.sh <node-name>
|
||||||
|
```
|
||||||
|
|
||||||
|
## Verification
|
||||||
|
|
||||||
|
### Fleet Overview
|
||||||
|
Run the verification script from any node with `redis-cli` access:
|
||||||
|
```bash
|
||||||
|
./scripts/deploy/verify-agent-fleet.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
### Redis Inspection (on PIHA)
|
||||||
|
```bash
|
||||||
|
docker exec agent-system-redis redis-cli KEYS 'homelab:nodes:*'
|
||||||
|
docker exec agent-system-redis redis-cli HGETALL homelab:nodes:<node-name>
|
||||||
|
```
|
||||||
|
|
||||||
|
Verify Web UI backend:
|
||||||
|
```bash
|
||||||
|
curl -s http://127.0.0.1:18180/nodes
|
||||||
|
curl -k https://agents.okit.pl/nodes
|
||||||
|
```
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
- **Redis empty after compose down**: The `agent-system-redis` on PIHA uses transient storage if not configured with a volume. If it restarts, agents must republish their state (they do this automatically every `CHECK_INTERVAL`).
|
||||||
|
- **Secrets**: `.env` files and local secrets are not committed to the repo. Ensure `MQTT_HOST` and other specific secrets are set via overrides if needed.
|
||||||
|
- **Telegram**: Telegram bot notifications can remain disabled if `TELEGRAM_BOT_TOKEN` is absent.
|
||||||
|
- **Docker Socket**: If the agent reports `unavailable` for Docker, ensure `/var/run/docker.sock` is mounted and the user has permissions.
|
||||||
|
|
@ -49,9 +49,10 @@ Runtime state must live outside the repository to keep it immutable and clean.
|
||||||
## Service Standards
|
## Service Standards
|
||||||
|
|
||||||
1. **Normalization**: Every service MUST follow the `services/<service>/` layout.
|
1. **Normalization**: Every service MUST follow the `services/<service>/` layout.
|
||||||
2. **Metadata**: Every service MUST have a `service.yaml` defining its operational contract.
|
2. **Metadata**: Every service MUST have a `service.yaml` defining its operational contract. This is the primary source of truth for AI agents.
|
||||||
3. **Healthchecks**: Every service MUST have a `healthcheck.sh` for verification.
|
3. **Healthchecks**: Every service MUST have a `healthcheck.sh` for verification. Agents use this to emit stability events.
|
||||||
4. **Secrets**: NEVER commit secrets to Git. Use `env.example` as a template and populate `/opt/homelab/config/<service>/.env` on the host.
|
4. **Actionability**: Any automated recovery action proposed by an agent must be backed by a `service.yaml` definition.
|
||||||
|
5. **Secrets**: NEVER commit secrets to Git. Use `env.example` as a template and populate `/opt/homelab/config/<service>/.env` on the host. Agents must treat these as "black box" configurations.
|
||||||
|
|
||||||
## Docker Compose Standards
|
## Docker Compose Standards
|
||||||
|
|
||||||
|
|
|
||||||
126
docs/vps-control-plane.md
Normal file
126
docs/vps-control-plane.md
Normal file
|
|
@ -0,0 +1,126 @@
|
||||||
|
# VPS Control Plane
|
||||||
|
|
||||||
|
The VPS Control Plane is the orchestration brain of the homelab platform. It runs on the Hetzner VPS (Tailscale IP: `100.95.58.48`) and provides observability, automated reconciliation, and a web-based operator interface.
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
The control plane consists of four core services running as a Docker Compose stack under `services/control-plane/`:
|
||||||
|
|
||||||
|
| Container | Role |
|
||||||
|
|-----------|------|
|
||||||
|
| `control-plane-observer` | Synthesizes world state from events in `/opt/homelab/events/` |
|
||||||
|
| `control-plane-supervisor` | Detects drift between desired state (`hosts/*/services.yaml`) and actual state (`world/services.json`); writes pending actions |
|
||||||
|
| `control-plane-executor` | Executes approved actions from `/opt/homelab/actions/approved/` |
|
||||||
|
| `control-plane-ui` | Web interface for system monitoring and action approval; serves port 18180 |
|
||||||
|
|
||||||
|
All services use **filesystem-first** semantics with `/opt/homelab/` as the data exchange layer. All four run with `network_mode: host` and as UID 1000 (`homelab` user).
|
||||||
|
|
||||||
|
## Supervisor Behavior
|
||||||
|
|
||||||
|
### Desired State
|
||||||
|
Loaded from `hosts/*/services.yaml` each reconcile cycle. Services with `monitor: false` are silently skipped — use this for services without a node-agent (e.g. `homeassistant` on `chelsty-ha`).
|
||||||
|
|
||||||
|
### Drift Types
|
||||||
|
- `missing_service` — service is in desired state but absent from `services.json`
|
||||||
|
- `unhealthy_service` — service exists in `services.json` but `status != healthy`
|
||||||
|
|
||||||
|
### Action Types
|
||||||
|
| Trigger | Action type | Risk |
|
||||||
|
|---------|-------------|------|
|
||||||
|
| `containers_not_running`, `mqtt_unreachable` | `container_restart` | low |
|
||||||
|
| Any other / unknown | `redeploy` | guarded |
|
||||||
|
| Node `disk_pressure: high` | `disk_cleanup` | guarded |
|
||||||
|
|
||||||
|
### Action ID Stability
|
||||||
|
Action IDs are deterministic: `redeploy-{node}-{service}` or `container-restart-{node}-{service}`. The same drift always produces the same filename, making reconcile truly idempotent across supervisor restarts.
|
||||||
|
|
||||||
|
### Auto-Cancel
|
||||||
|
Pending `redeploy` and `container_restart` actions are automatically moved to `cancelled/` when:
|
||||||
|
- **`drift_resolved_auto`** — the service becomes `healthy` in actual state
|
||||||
|
- **`service_removed_from_desired_state`** — the service was removed from `services.yaml` or marked `monitor: false`
|
||||||
|
|
||||||
|
Only `pending` actions are auto-cancelled. Approved/running actions have been committed to by the operator and are never cancelled automatically.
|
||||||
|
|
||||||
|
### Node Name Resolution
|
||||||
|
The supervisor supports a `NODE_ALIAS_MAP` environment variable (JSON string) to map event/world-state node names to canonical topology names:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
NODE_ALIAS_MAP='{"node-2": "chelsty-infra", "node-1": "piha"}'
|
||||||
|
```
|
||||||
|
|
||||||
|
## Deployment
|
||||||
|
|
||||||
|
### From SATURN (primary control node)
|
||||||
|
```bash
|
||||||
|
# Full deploy via SSH
|
||||||
|
./scripts/deploy/deploy-control-plane.sh --ssh
|
||||||
|
|
||||||
|
# Or manually:
|
||||||
|
ssh oskar@100.95.58.48 "cd ~/homelab-codex-ws && git pull origin master && cd services/control-plane && docker compose up -d --build --force-recreate"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Direct on VPS
|
||||||
|
```bash
|
||||||
|
cd ~/homelab-codex-ws/services/control-plane
|
||||||
|
docker compose up -d --build --force-recreate
|
||||||
|
```
|
||||||
|
|
||||||
|
`deploy-local.sh` also creates the required `/opt/homelab/` directory structure and sets ownership to UID 1000 (requires `sudo`). If directories already exist, skip to the `docker compose` step directly.
|
||||||
|
|
||||||
|
### Verification
|
||||||
|
```bash
|
||||||
|
# On VPS
|
||||||
|
docker ps --filter "name=control-plane"
|
||||||
|
curl -s http://localhost:18180/summary | python3 -m json.tool
|
||||||
|
```
|
||||||
|
|
||||||
|
## Action Approval Workflow
|
||||||
|
|
||||||
|
```
|
||||||
|
Supervisor writes → /opt/homelab/actions/pending/<id>.json
|
||||||
|
→ Operator UI (port 18180) or Telegram Bot notifies
|
||||||
|
→ Operator clicks Approve
|
||||||
|
→ /opt/homelab/actions/approved/<id>.json
|
||||||
|
→ Executor executes → completed / failed
|
||||||
|
```
|
||||||
|
|
||||||
|
Possible action states: `pending → approved → running → completed / failed / rejected`
|
||||||
|
Auto-cancel path: `pending → cancelled/`
|
||||||
|
|
||||||
|
## Recovery
|
||||||
|
|
||||||
|
### World state is stale or corrupt
|
||||||
|
```bash
|
||||||
|
# On VPS — delete checkpoint to force full replay
|
||||||
|
rm /opt/homelab/state/observer_checkpoint.json
|
||||||
|
docker restart control-plane-observer
|
||||||
|
```
|
||||||
|
|
||||||
|
### Flood of pending actions after bootstrap
|
||||||
|
Check if node-agent is running and emitting `service_healthy` events on each node. Without `service_healthy`, the supervisor sees all services as missing and queues redeployments every cycle.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check node-agent on each node
|
||||||
|
ssh oskar@<node> "docker ps --filter name=node-agent && docker logs node-agent --tail 20"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Rebuild from scratch
|
||||||
|
```bash
|
||||||
|
ssh oskar@100.95.58.48 "cd ~/homelab-codex-ws/services/control-plane && docker compose up -d --build --force-recreate"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Integration
|
||||||
|
|
||||||
|
### piha agent-system webui (port 18180 on piha)
|
||||||
|
The `agent-system-runtime-materializer` on piha polls the VPS control-plane API every 10 seconds and mirrors world state to piha's local `/opt/homelab/world/`. This ensures the **"Copy for AI"** button in the piha webui (`agent-system-webui`) reflects the same clean state as the VPS API.
|
||||||
|
|
||||||
|
Override: `hosts/piha/runtime/agent-system/docker-compose.override.yml` — sets `CONTROL_PLANE_URL=http://100.95.58.48:18180`.
|
||||||
|
|
||||||
|
### Nginx Proxy Manager
|
||||||
|
The operator UI at port 18180 can be proxied via NPM for external access. No WebSocket support required.
|
||||||
|
|
||||||
|
### Log Locations
|
||||||
|
- Container logs: `docker compose logs -f` (from `services/control-plane/`)
|
||||||
|
- Runtime events: `/opt/homelab/events/YYYY-MM-DD/`
|
||||||
|
- World state: `/opt/homelab/world/`
|
||||||
|
- Action queue: `/opt/homelab/actions/{pending,approved,running,completed,failed,cancelled}/`
|
||||||
24
hosts/chelsty-ha/capabilities.yaml
Normal file
24
hosts/chelsty-ha/capabilities.yaml
Normal file
|
|
@ -0,0 +1,24 @@
|
||||||
|
host: chelsty-ha
|
||||||
|
site: chelsty
|
||||||
|
|
||||||
|
capabilities:
|
||||||
|
networking:
|
||||||
|
reachability: tailscale-only
|
||||||
|
tailscale_ip: 100.122.201.23
|
||||||
|
ingress_suitability: false
|
||||||
|
bandwidth: LTE
|
||||||
|
|
||||||
|
runtime:
|
||||||
|
container_engine: docker
|
||||||
|
os: debian
|
||||||
|
|
||||||
|
operational:
|
||||||
|
connectivity: intermittent
|
||||||
|
availability_target: best-effort
|
||||||
|
offline_first: true
|
||||||
|
uplink: lte
|
||||||
|
|
||||||
|
deployment:
|
||||||
|
suitability:
|
||||||
|
- homeassistant
|
||||||
|
restricted: false
|
||||||
20
hosts/chelsty-ha/host.yaml
Normal file
20
hosts/chelsty-ha/host.yaml
Normal file
|
|
@ -0,0 +1,20 @@
|
||||||
|
hostname: chelsty-ha
|
||||||
|
site: chelsty
|
||||||
|
|
||||||
|
roles:
|
||||||
|
- homeassistant
|
||||||
|
|
||||||
|
network:
|
||||||
|
tailscale_ip: 100.122.201.23
|
||||||
|
|
||||||
|
runtime:
|
||||||
|
root: /opt/homelab
|
||||||
|
|
||||||
|
deployment:
|
||||||
|
mode: pull
|
||||||
|
managed_by: saturn
|
||||||
|
|
||||||
|
constraints:
|
||||||
|
connectivity:
|
||||||
|
intermittent: true
|
||||||
|
uplink: lte
|
||||||
12
hosts/chelsty-ha/services.yaml
Normal file
12
hosts/chelsty-ha/services.yaml
Normal file
|
|
@ -0,0 +1,12 @@
|
||||||
|
host: chelsty-ha
|
||||||
|
site: chelsty
|
||||||
|
|
||||||
|
services:
|
||||||
|
homeassistant:
|
||||||
|
role: home-automation-controller
|
||||||
|
offline_required: true
|
||||||
|
# monitor: false — chelsty-ha has no node-agent deployed, so there are no
|
||||||
|
# container-health events for the observer to track. HA is monitored
|
||||||
|
# indirectly via the chelsty-infra MQTT broker (if MQTT goes silent, HA
|
||||||
|
# is likely down). Re-enable once node-agent is bootstrapped on this VM.
|
||||||
|
monitor: false
|
||||||
|
|
@ -1,3 +1,6 @@
|
||||||
|
host: chelsty-infra
|
||||||
|
site: chelsty
|
||||||
|
|
||||||
capabilities:
|
capabilities:
|
||||||
hardware:
|
hardware:
|
||||||
cpu:
|
cpu:
|
||||||
|
|
@ -31,10 +34,11 @@ capabilities:
|
||||||
power_constraint: low-power
|
power_constraint: low-power
|
||||||
connectivity: intermittent
|
connectivity: intermittent
|
||||||
availability_target: best-effort
|
availability_target: best-effort
|
||||||
|
offline_operation_required: true
|
||||||
|
|
||||||
deployment:
|
deployment:
|
||||||
suitability:
|
suitability:
|
||||||
- staging
|
- staging
|
||||||
- homeassistant
|
- infra
|
||||||
- edge
|
- edge
|
||||||
restricted: false
|
restricted: false
|
||||||
|
|
@ -1,9 +1,10 @@
|
||||||
hostname: chelsty
|
hostname: chelsty-infra
|
||||||
|
site: chelsty
|
||||||
|
|
||||||
roles:
|
roles:
|
||||||
- edge
|
- edge
|
||||||
- hypervisor
|
- hypervisor
|
||||||
- homeassistant
|
- infra
|
||||||
- staging
|
- staging
|
||||||
|
|
||||||
network:
|
network:
|
||||||
|
|
@ -1,4 +1,4 @@
|
||||||
host: chelsty
|
host: chelsty-infra
|
||||||
|
|
||||||
uplink:
|
uplink:
|
||||||
type: lte
|
type: lte
|
||||||
|
|
@ -20,7 +20,7 @@ exposure_classes:
|
||||||
|
|
||||||
networks:
|
networks:
|
||||||
home_automation_lan:
|
home_automation_lan:
|
||||||
purpose: Home Assistant, MQTT, Zigbee coordinator, and local device control.
|
purpose: MQTT broker, Zigbee coordinator, and local device control.
|
||||||
offline_required: true
|
offline_required: true
|
||||||
internet_required_for_core_operation: false
|
internet_required_for_core_operation: false
|
||||||
|
|
||||||
|
|
@ -1,4 +1,4 @@
|
||||||
host: chelsty
|
host: chelsty-infra
|
||||||
|
|
||||||
runtime_root: /opt/homelab
|
runtime_root: /opt/homelab
|
||||||
|
|
||||||
|
|
@ -9,12 +9,6 @@ conventions:
|
||||||
logs: /opt/homelab/logs
|
logs: /opt/homelab/logs
|
||||||
|
|
||||||
services:
|
services:
|
||||||
homeassistant:
|
|
||||||
data: /opt/homelab/data/homeassistant
|
|
||||||
config: /opt/homelab/config/homeassistant
|
|
||||||
logs: /opt/homelab/logs/homeassistant
|
|
||||||
backup_priority: critical
|
|
||||||
|
|
||||||
zigbee2mqtt:
|
zigbee2mqtt:
|
||||||
data: /opt/homelab/data/zigbee2mqtt
|
data: /opt/homelab/data/zigbee2mqtt
|
||||||
config: /opt/homelab/config/zigbee2mqtt
|
config: /opt/homelab/config/zigbee2mqtt
|
||||||
|
|
@ -27,13 +21,13 @@ services:
|
||||||
logs: /opt/homelab/logs/mosquitto
|
logs: /opt/homelab/logs/mosquitto
|
||||||
backup_priority: high
|
backup_priority: high
|
||||||
|
|
||||||
backup_sets:
|
stability-agent:
|
||||||
homeassistant:
|
data: /opt/homelab/state
|
||||||
include:
|
config: /opt/homelab/config/stability-agent
|
||||||
- /opt/homelab/config/homeassistant
|
logs: /opt/homelab/events
|
||||||
- /opt/homelab/data/homeassistant
|
backup_priority: low
|
||||||
restore_note: Restore before starting the Home Assistant container.
|
|
||||||
|
|
||||||
|
backup_sets:
|
||||||
zigbee2mqtt:
|
zigbee2mqtt:
|
||||||
include:
|
include:
|
||||||
- /opt/homelab/config/zigbee2mqtt
|
- /opt/homelab/config/zigbee2mqtt
|
||||||
88
hosts/chelsty-infra/runtime/frigate/config.yml
Normal file
88
hosts/chelsty-infra/runtime/frigate/config.yml
Normal file
|
|
@ -0,0 +1,88 @@
|
||||||
|
# Frigate NVR — chelsty-infra
|
||||||
|
# Hardware decode: Intel UHD 630 via VAAPI (/dev/dri/renderD128)
|
||||||
|
# Object detection: CPU (no Coral TPU)
|
||||||
|
# Cameras: 2x Reolink RLC-540 (5MP, WiFi)
|
||||||
|
#
|
||||||
|
# Required env vars in /opt/homelab/config/frigate/frigate.env:
|
||||||
|
# CAMERA1_IP, CAMERA1_USER, CAMERA1_PASS
|
||||||
|
# CAMERA2_IP, CAMERA2_USER, CAMERA2_PASS
|
||||||
|
# MQTT_USER, MQTT_PASS (if mosquitto auth is enabled)
|
||||||
|
|
||||||
|
mqtt:
|
||||||
|
enabled: true
|
||||||
|
host: 127.0.0.1
|
||||||
|
port: 1883
|
||||||
|
# user: "{MQTT_USER}"
|
||||||
|
# password: "{MQTT_PASS}"
|
||||||
|
|
||||||
|
detectors:
|
||||||
|
cpu1:
|
||||||
|
type: cpu
|
||||||
|
num_threads: 3
|
||||||
|
|
||||||
|
ffmpeg:
|
||||||
|
hwaccel_args: preset-vaapi
|
||||||
|
global_args:
|
||||||
|
- -hide_banner
|
||||||
|
- -loglevel
|
||||||
|
- warning
|
||||||
|
|
||||||
|
record:
|
||||||
|
enabled: true
|
||||||
|
retain:
|
||||||
|
days: 7
|
||||||
|
mode: all
|
||||||
|
events:
|
||||||
|
retain:
|
||||||
|
default: 14
|
||||||
|
mode: motion
|
||||||
|
|
||||||
|
snapshots:
|
||||||
|
enabled: true
|
||||||
|
retain:
|
||||||
|
default: 7
|
||||||
|
quality: 70
|
||||||
|
|
||||||
|
objects:
|
||||||
|
track:
|
||||||
|
- person
|
||||||
|
- car
|
||||||
|
- bicycle
|
||||||
|
filters:
|
||||||
|
person:
|
||||||
|
min_area: 5000
|
||||||
|
max_area: 100000
|
||||||
|
threshold: 0.7
|
||||||
|
|
||||||
|
cameras:
|
||||||
|
camera1:
|
||||||
|
ffmpeg:
|
||||||
|
inputs:
|
||||||
|
# Main stream — high-res recording
|
||||||
|
- path: rtsp://{CAMERA1_USER}:{CAMERA1_PASS}@{CAMERA1_IP}:554/h264Preview_01_main
|
||||||
|
roles:
|
||||||
|
- record
|
||||||
|
# Sub stream — low-res detection (lower CPU cost)
|
||||||
|
- path: rtsp://{CAMERA1_USER}:{CAMERA1_PASS}@{CAMERA1_IP}:554/h264Preview_01_sub
|
||||||
|
roles:
|
||||||
|
- detect
|
||||||
|
detect:
|
||||||
|
enabled: true
|
||||||
|
width: 640
|
||||||
|
height: 480
|
||||||
|
fps: 5
|
||||||
|
|
||||||
|
camera2:
|
||||||
|
ffmpeg:
|
||||||
|
inputs:
|
||||||
|
- path: rtsp://{CAMERA2_USER}:{CAMERA2_PASS}@{CAMERA2_IP}:554/h264Preview_01_main
|
||||||
|
roles:
|
||||||
|
- record
|
||||||
|
- path: rtsp://{CAMERA2_USER}:{CAMERA2_PASS}@{CAMERA2_IP}:554/h264Preview_01_sub
|
||||||
|
roles:
|
||||||
|
- detect
|
||||||
|
detect:
|
||||||
|
enabled: true
|
||||||
|
width: 640
|
||||||
|
height: 480
|
||||||
|
fps: 5
|
||||||
25
hosts/chelsty-infra/runtime/frigate/docker-compose.yml
Normal file
25
hosts/chelsty-infra/runtime/frigate/docker-compose.yml
Normal file
|
|
@ -0,0 +1,25 @@
|
||||||
|
services:
|
||||||
|
frigate:
|
||||||
|
container_name: frigate
|
||||||
|
image: ghcr.io/blakeblackshear/frigate:stable
|
||||||
|
restart: unless-stopped
|
||||||
|
privileged: true
|
||||||
|
shm_size: "256mb"
|
||||||
|
network_mode: host
|
||||||
|
devices:
|
||||||
|
- /dev/dri/renderD128:/dev/dri/renderD128
|
||||||
|
volumes:
|
||||||
|
- /etc/localtime:/etc/localtime:ro
|
||||||
|
- /opt/homelab/config/frigate/config.yml:/config/config.yml
|
||||||
|
- /opt/homelab/config/frigate:/config/credentials:ro
|
||||||
|
- /opt/homelab/data/frigate:/media/frigate
|
||||||
|
tmpfs:
|
||||||
|
- /tmp/cache
|
||||||
|
env_file:
|
||||||
|
- /opt/homelab/config/frigate/frigate.env
|
||||||
|
healthcheck:
|
||||||
|
test: ["CMD-SHELL", "wget -q --spider http://localhost:5000/api/version 2>&1 || exit 1"]
|
||||||
|
interval: 30s
|
||||||
|
timeout: 10s
|
||||||
|
retries: 3
|
||||||
|
start_period: 60s
|
||||||
|
|
@ -0,0 +1,11 @@
|
||||||
|
services:
|
||||||
|
node-agent:
|
||||||
|
environment:
|
||||||
|
- NODE_NAME=chelsty-infra
|
||||||
|
- NODE_TYPE=lte_node
|
||||||
|
- VPS_EVENTS_HOST=100.95.58.48
|
||||||
|
- VPS_EVENTS_USER=oskar
|
||||||
|
- VPS_EVENTS_PATH=/opt/homelab/events
|
||||||
|
- CHECK_INTERVAL=60
|
||||||
|
volumes:
|
||||||
|
- /home/oskar/.ssh:/root/.ssh:ro
|
||||||
|
|
@ -0,0 +1,12 @@
|
||||||
|
services:
|
||||||
|
stability-agent:
|
||||||
|
environment:
|
||||||
|
- NODE_NAME=chelsty-infra
|
||||||
|
- SITE_NAME=chelsty
|
||||||
|
- REDIS_HOST=100.108.208.3
|
||||||
|
- REDIS_PORT=6379
|
||||||
|
- REDIS_ENABLED=true
|
||||||
|
- STABILITY_CHECK_INTERVAL=60
|
||||||
|
- DISK_THRESHOLD_PCT=85
|
||||||
|
- MQTT_HOST=mosquitto
|
||||||
|
- MQTT_PORT=1883
|
||||||
|
|
@ -0,0 +1,21 @@
|
||||||
|
services:
|
||||||
|
zigbee2mqtt:
|
||||||
|
# mosquitto runs with network_mode: host on chelsty-infra.
|
||||||
|
# extra_hosts maps the 'mosquitto' hostname to the host gateway IP so that
|
||||||
|
# mqtt://mosquitto:1883 in configuration.yaml reaches the host-networked
|
||||||
|
# mosquitto process. Requires Docker 20.10+ (present on chelsty-infra).
|
||||||
|
extra_hosts:
|
||||||
|
- "mosquitto:host-gateway"
|
||||||
|
environment:
|
||||||
|
- TZ=Europe/Warsaw
|
||||||
|
healthcheck:
|
||||||
|
test: ["CMD-SHELL", "wget -qO- http://localhost:8080 > /dev/null 2>&1 || exit 1"]
|
||||||
|
interval: 30s
|
||||||
|
timeout: 10s
|
||||||
|
retries: 3
|
||||||
|
start_period: 90s
|
||||||
|
# Note: volumes NOT overridden here.
|
||||||
|
# The base docker-compose.yml mounts /opt/homelab/data/zigbee2mqtt/data:/app/data
|
||||||
|
# (read-write). configuration.yaml must be placed in that directory on the node:
|
||||||
|
# /opt/homelab/data/zigbee2mqtt/data/configuration.yaml
|
||||||
|
# z2m rewrites this file during migrations — read-only mount is not viable.
|
||||||
37
hosts/chelsty-infra/services.yaml
Normal file
37
hosts/chelsty-infra/services.yaml
Normal file
|
|
@ -0,0 +1,37 @@
|
||||||
|
host: chelsty-infra
|
||||||
|
site: chelsty
|
||||||
|
|
||||||
|
services:
|
||||||
|
ha-diag-agent:
|
||||||
|
role: ha-diagnostic-agent
|
||||||
|
deployment_model: docker-compose
|
||||||
|
exposure: local-only
|
||||||
|
offline_required: false
|
||||||
|
depends_on:
|
||||||
|
local: []
|
||||||
|
external: [homeassistant]
|
||||||
|
config:
|
||||||
|
target_url: http://100.70.180.90:8123 # chelsty-ha via Tailscale (HAOS, separate VM)
|
||||||
|
location_tag: "chelsty"
|
||||||
|
events_dir: /opt/homelab/events/chelsty-infra
|
||||||
|
runtime:
|
||||||
|
config_path: /opt/homelab/config/ha-diag-agent
|
||||||
|
data_path: /var/lib/ha-diag-agent
|
||||||
|
|
||||||
|
node-agent:
|
||||||
|
role: node-stability-monitor
|
||||||
|
# LTE node: node-agent monitors and emits events but does NO Docker cleanup.
|
||||||
|
# Disk pressure on chelsty-infra is typically Frigate recordings; Frigate's
|
||||||
|
# own retain policy is the correct remediation, not docker prune.
|
||||||
|
deployment_model: docker-compose
|
||||||
|
exposure: local-only
|
||||||
|
offline_required: true
|
||||||
|
|
||||||
|
mosquitto:
|
||||||
|
role: local-mqtt-broker
|
||||||
|
|
||||||
|
zigbee2mqtt:
|
||||||
|
role: zigbee-mqtt-bridge
|
||||||
|
|
||||||
|
frigate:
|
||||||
|
role: nvr
|
||||||
|
|
@ -1,13 +0,0 @@
|
||||||
services:
|
|
||||||
zigbee2mqtt:
|
|
||||||
volumes:
|
|
||||||
- ./configuration.yaml:/app/data/configuration.yaml:ro
|
|
||||||
environment:
|
|
||||||
- MQTT_USER=${MQTT_USER}
|
|
||||||
- MQTT_PASSWORD=${MQTT_PASSWORD}
|
|
||||||
# Healthcheck is already defined in base service, but we ensure compatibility
|
|
||||||
healthcheck:
|
|
||||||
test: ["CMD", "curl", "-f", "http://localhost:8080"]
|
|
||||||
interval: 10s
|
|
||||||
timeout: 5s
|
|
||||||
retries: 3
|
|
||||||
|
|
@ -1,108 +0,0 @@
|
||||||
host: chelsty
|
|
||||||
|
|
||||||
exposure_classes:
|
|
||||||
local-only:
|
|
||||||
description: Reachable only from CHELSTY-local networks or container networks.
|
|
||||||
public_ingress: false
|
|
||||||
tailscale_required: false
|
|
||||||
tailscale-internal:
|
|
||||||
description: Reachable through the Tailscale mesh by approved tailnet clients.
|
|
||||||
public_ingress: false
|
|
||||||
tailscale_required: true
|
|
||||||
public:
|
|
||||||
description: Reachable from the public internet through an explicit ingress path.
|
|
||||||
public_ingress: true
|
|
||||||
tailscale_required: false
|
|
||||||
|
|
||||||
operational_constraints:
|
|
||||||
uplink: lte
|
|
||||||
connectivity: intermittent
|
|
||||||
offline_operation_required: true
|
|
||||||
must_not_depend_on:
|
|
||||||
- saturn
|
|
||||||
- vps
|
|
||||||
- forgejo
|
|
||||||
|
|
||||||
services:
|
|
||||||
homeassistant:
|
|
||||||
role: home-automation-controller
|
|
||||||
deployment_model: docker-compose
|
|
||||||
exposure: tailscale-internal
|
|
||||||
offline_required: true
|
|
||||||
depends_on:
|
|
||||||
local:
|
|
||||||
- mosquitto
|
|
||||||
- zigbee2mqtt
|
|
||||||
external: []
|
|
||||||
ports:
|
|
||||||
- name: http
|
|
||||||
container_port: 8123
|
|
||||||
protocol: tcp
|
|
||||||
runtime:
|
|
||||||
config_path: /opt/homelab/config/homeassistant
|
|
||||||
data_path: /opt/homelab/data/homeassistant
|
|
||||||
logs_path: /opt/homelab/logs/homeassistant
|
|
||||||
backup:
|
|
||||||
recommended: true
|
|
||||||
include:
|
|
||||||
- /opt/homelab/config/homeassistant
|
|
||||||
- /opt/homelab/data/homeassistant
|
|
||||||
notes:
|
|
||||||
- Back up before Home Assistant core, supervisor-equivalent, or integration upgrades.
|
|
||||||
- Keep local restore copies on CHELSTY because LTE connectivity may be unavailable during recovery.
|
|
||||||
|
|
||||||
zigbee2mqtt:
|
|
||||||
role: zigbee-mqtt-bridge
|
|
||||||
deployment_model: docker-compose
|
|
||||||
exposure: local-only
|
|
||||||
offline_required: true
|
|
||||||
depends_on:
|
|
||||||
local:
|
|
||||||
- mosquitto
|
|
||||||
external:
|
|
||||||
- slzb-06u
|
|
||||||
coordinator:
|
|
||||||
name: slzb-06u
|
|
||||||
connection: network
|
|
||||||
usb_device: null
|
|
||||||
ports:
|
|
||||||
- name: frontend
|
|
||||||
container_port: 8080
|
|
||||||
protocol: tcp
|
|
||||||
exposure: tailscale-internal
|
|
||||||
runtime:
|
|
||||||
config_path: /opt/homelab/config/zigbee2mqtt
|
|
||||||
data_path: /opt/homelab/data/zigbee2mqtt
|
|
||||||
logs_path: /opt/homelab/logs/zigbee2mqtt
|
|
||||||
backup:
|
|
||||||
recommended: true
|
|
||||||
include:
|
|
||||||
- /opt/homelab/config/zigbee2mqtt
|
|
||||||
- /opt/homelab/data/zigbee2mqtt
|
|
||||||
notes:
|
|
||||||
- Include configuration.yaml, database.db, coordinator backup files, and network key material.
|
|
||||||
- Restore Zigbee2MQTT state together with the SLZB-06U coordinator state when replacing hardware.
|
|
||||||
|
|
||||||
mosquitto:
|
|
||||||
role: local-mqtt-broker
|
|
||||||
deployment_model: docker-compose
|
|
||||||
exposure: local-only
|
|
||||||
offline_required: true
|
|
||||||
depends_on:
|
|
||||||
local: []
|
|
||||||
external: []
|
|
||||||
ports:
|
|
||||||
- name: mqtt
|
|
||||||
container_port: 1883
|
|
||||||
protocol: tcp
|
|
||||||
runtime:
|
|
||||||
config_path: /opt/homelab/config/mosquitto
|
|
||||||
data_path: /opt/homelab/data/mosquitto
|
|
||||||
logs_path: /opt/homelab/logs/mosquitto
|
|
||||||
backup:
|
|
||||||
recommended: true
|
|
||||||
include:
|
|
||||||
- /opt/homelab/config/mosquitto
|
|
||||||
- /opt/homelab/data/mosquitto
|
|
||||||
notes:
|
|
||||||
- Retain ACL, password, persistence, and bridge configuration if enabled.
|
|
||||||
|
|
@ -0,0 +1,8 @@
|
||||||
|
services:
|
||||||
|
runtime-materializer:
|
||||||
|
environment:
|
||||||
|
# Pull world state from the VPS control-plane API instead of local Redis.
|
||||||
|
# The observer on VPS is the authoritative writer; mirroring its API output
|
||||||
|
# here ensures the webui /snapshot matches the clean 97-service state that
|
||||||
|
# the control-plane /summary endpoint serves.
|
||||||
|
CONTROL_PLANE_URL: "http://100.95.58.48:18180"
|
||||||
|
|
@ -0,0 +1,4 @@
|
||||||
|
services:
|
||||||
|
brain-watchdog:
|
||||||
|
mem_limit: 64m
|
||||||
|
restart: unless-stopped
|
||||||
11
hosts/piha/runtime/node-agent/docker-compose.override.yml
Normal file
11
hosts/piha/runtime/node-agent/docker-compose.override.yml
Normal file
|
|
@ -0,0 +1,11 @@
|
||||||
|
services:
|
||||||
|
node-agent:
|
||||||
|
environment:
|
||||||
|
- NODE_NAME=piha
|
||||||
|
- NODE_TYPE=sd_card
|
||||||
|
- VPS_EVENTS_HOST=100.95.58.48
|
||||||
|
- VPS_EVENTS_USER=oskar
|
||||||
|
- VPS_EVENTS_PATH=/opt/homelab/events
|
||||||
|
- CHECK_INTERVAL=60
|
||||||
|
volumes:
|
||||||
|
- /home/oskar/.ssh:/root/.ssh:ro
|
||||||
|
|
@ -0,0 +1,7 @@
|
||||||
|
services:
|
||||||
|
stability-agent:
|
||||||
|
environment:
|
||||||
|
- NODE_NAME=piha
|
||||||
|
- REDIS_HOST=100.108.208.3
|
||||||
|
- REDIS_PORT=6379
|
||||||
|
- REDIS_ENABLED=true
|
||||||
42
hosts/piha/services.yaml
Normal file
42
hosts/piha/services.yaml
Normal file
|
|
@ -0,0 +1,42 @@
|
||||||
|
host: piha
|
||||||
|
|
||||||
|
services:
|
||||||
|
ha-diag-agent:
|
||||||
|
role: ha-diagnostic-agent
|
||||||
|
deployment_model: docker-compose
|
||||||
|
exposure: local-only
|
||||||
|
offline_required: false
|
||||||
|
depends_on:
|
||||||
|
local: []
|
||||||
|
external: [homeassistant]
|
||||||
|
config:
|
||||||
|
target_url: http://localhost:8123
|
||||||
|
location_tag: "ken"
|
||||||
|
events_dir: /opt/homelab/events/piha
|
||||||
|
runtime:
|
||||||
|
config_path: /opt/homelab/config/ha-diag-agent
|
||||||
|
data_path: /var/lib/ha-diag-agent
|
||||||
|
|
||||||
|
node-agent:
|
||||||
|
role: node-stability-monitor
|
||||||
|
deployment_model: docker-compose
|
||||||
|
exposure: local-only
|
||||||
|
offline_required: true
|
||||||
|
depends_on:
|
||||||
|
local: []
|
||||||
|
external: []
|
||||||
|
runtime:
|
||||||
|
config_path: /opt/homelab/config/node-agent
|
||||||
|
data_path: /opt/homelab/state
|
||||||
|
logs_path: /opt/homelab/events
|
||||||
|
|
||||||
|
brain-watchdog:
|
||||||
|
role: control-plane-watchdog
|
||||||
|
deployment_model: docker-compose
|
||||||
|
exposure: private
|
||||||
|
offline_required: false
|
||||||
|
depends_on:
|
||||||
|
local: []
|
||||||
|
external: [control-plane]
|
||||||
|
runtime:
|
||||||
|
config_path: /opt/homelab/config/brain-watchdog
|
||||||
11
hosts/solaria/runtime/node-agent/docker-compose.override.yml
Normal file
11
hosts/solaria/runtime/node-agent/docker-compose.override.yml
Normal file
|
|
@ -0,0 +1,11 @@
|
||||||
|
services:
|
||||||
|
node-agent:
|
||||||
|
environment:
|
||||||
|
- NODE_NAME=solaria
|
||||||
|
- NODE_TYPE=ai_node
|
||||||
|
- VPS_EVENTS_HOST=100.95.58.48
|
||||||
|
- VPS_EVENTS_USER=oskar
|
||||||
|
- VPS_EVENTS_PATH=/opt/homelab/events
|
||||||
|
- CHECK_INTERVAL=60
|
||||||
|
volumes:
|
||||||
|
- /home/oskar/.ssh:/root/.ssh:ro
|
||||||
|
|
@ -0,0 +1,7 @@
|
||||||
|
services:
|
||||||
|
stability-agent:
|
||||||
|
environment:
|
||||||
|
- NODE_NAME=solaria
|
||||||
|
- REDIS_HOST=100.108.208.3
|
||||||
|
- REDIS_PORT=6379
|
||||||
|
- REDIS_ENABLED=true
|
||||||
15
hosts/solaria/services.yaml
Normal file
15
hosts/solaria/services.yaml
Normal file
|
|
@ -0,0 +1,15 @@
|
||||||
|
host: solaria
|
||||||
|
|
||||||
|
services:
|
||||||
|
node-agent:
|
||||||
|
role: node-stability-monitor
|
||||||
|
deployment_model: docker-compose
|
||||||
|
exposure: local-only
|
||||||
|
offline_required: true
|
||||||
|
depends_on:
|
||||||
|
local: []
|
||||||
|
external: []
|
||||||
|
runtime:
|
||||||
|
config_path: /opt/homelab/config/node-agent
|
||||||
|
data_path: /opt/homelab/state
|
||||||
|
logs_path: /opt/homelab/events
|
||||||
39
hosts/vps/runtime/control-plane/docker-compose.override.yml
Normal file
39
hosts/vps/runtime/control-plane/docker-compose.override.yml
Normal file
|
|
@ -0,0 +1,39 @@
|
||||||
|
# Control-plane production overrides for the VPS deployment.
|
||||||
|
#
|
||||||
|
# NODE_ALIAS_MAP translates the node names that appear in raw event files
|
||||||
|
# (written by node agents / seed scripts) to the canonical names used in
|
||||||
|
# inventory/topology.yaml and hosts/*/services.yaml.
|
||||||
|
#
|
||||||
|
# Current live mapping (from /opt/homelab/events/ inspection):
|
||||||
|
# node-2 → chelsty (zigbee2mqtt / mosquitto / homeassistant node)
|
||||||
|
#
|
||||||
|
# Add further entries when new nodes come online and their event-source names
|
||||||
|
# differ from their topology names. Format is a single-line JSON object, e.g.:
|
||||||
|
# NODE_ALIAS_MAP='{"node-2":"chelsty","node-3":"piha"}'
|
||||||
|
#
|
||||||
|
# The executor inherits the canonical name from the action JSON written by the
|
||||||
|
# supervisor, so NODE_ALIAS_MAP is only required on the supervisor service.
|
||||||
|
#
|
||||||
|
# Memory limits: VPS has 4 GiB RAM, no swap. oom_score_adj -900 ensures the
|
||||||
|
# host kernel OOM-killer never targets control-plane containers. mem_limit
|
||||||
|
# provides a per-container cgroup ceiling so a leaking process is restarted by
|
||||||
|
# Docker before it can exhaust host memory.
|
||||||
|
|
||||||
|
services:
|
||||||
|
operator-ui:
|
||||||
|
mem_limit: 192m
|
||||||
|
oom_score_adj: -900
|
||||||
|
|
||||||
|
observer:
|
||||||
|
mem_limit: 192m
|
||||||
|
oom_score_adj: -900
|
||||||
|
|
||||||
|
supervisor:
|
||||||
|
mem_limit: 400m
|
||||||
|
oom_score_adj: -900
|
||||||
|
environment:
|
||||||
|
- NODE_ALIAS_MAP={"node-2":"chelsty"}
|
||||||
|
|
||||||
|
executor:
|
||||||
|
mem_limit: 64m
|
||||||
|
oom_score_adj: -900
|
||||||
7
hosts/vps/runtime/control-plane/env.example
Normal file
7
hosts/vps/runtime/control-plane/env.example
Normal file
|
|
@ -0,0 +1,7 @@
|
||||||
|
# Control Plane Environment Variables
|
||||||
|
PORT=8080
|
||||||
|
HOMELAB_STATE_ROOT=/opt/homelab/state
|
||||||
|
HOMELAB_EVENTS_ROOT=/opt/homelab/events
|
||||||
|
HOMELAB_WORLD_ROOT=/opt/homelab/world
|
||||||
|
HOMELAB_ACTIONS_ROOT=/opt/homelab/actions
|
||||||
|
HOMELAB_CONFIG_ROOT=/opt/homelab/config
|
||||||
16
hosts/vps/runtime/node-agent/docker-compose.override.yml
Normal file
16
hosts/vps/runtime/node-agent/docker-compose.override.yml
Normal file
|
|
@ -0,0 +1,16 @@
|
||||||
|
services:
|
||||||
|
node-agent:
|
||||||
|
environment:
|
||||||
|
- NODE_NAME=vps
|
||||||
|
- CHECK_INTERVAL=60
|
||||||
|
# host network mode: node-agent on VPS shares the host's network namespace
|
||||||
|
# so that localhost:18180 resolves to the control-plane's exposed port.
|
||||||
|
# Without this, localhost inside the container is the container's own loopback
|
||||||
|
# and the _check_control_plane_health() probe would always fail.
|
||||||
|
network_mode: host
|
||||||
|
# HARD memory ceiling: node-agent mounts /opt/homelab/events/ (page cache)
|
||||||
|
# and may accumulate Python RSS over hours; 640m cap ensures it is killed and
|
||||||
|
# auto-restarted by Docker before consuming host memory. oom_score_adj -900
|
||||||
|
# prevents the host kernel OOM-killer from picking it as a global victim.
|
||||||
|
mem_limit: 640m
|
||||||
|
oom_score_adj: -900
|
||||||
|
|
@ -0,0 +1,9 @@
|
||||||
|
services:
|
||||||
|
stability-agent:
|
||||||
|
environment:
|
||||||
|
- NODE_NAME=vps
|
||||||
|
- REDIS_HOST=100.108.208.3
|
||||||
|
- REDIS_PORT=6379
|
||||||
|
- REDIS_ENABLED=true
|
||||||
|
mem_limit: 96m
|
||||||
|
oom_score_adj: -900
|
||||||
|
|
@ -1 +0,0 @@
|
||||||
npm
|
|
||||||
43
hosts/vps/services.yaml
Normal file
43
hosts/vps/services.yaml
Normal file
|
|
@ -0,0 +1,43 @@
|
||||||
|
host: vps
|
||||||
|
|
||||||
|
services:
|
||||||
|
node-agent:
|
||||||
|
role: node-stability-monitor
|
||||||
|
deployment_model: docker-compose
|
||||||
|
exposure: local-only
|
||||||
|
offline_required: true
|
||||||
|
depends_on:
|
||||||
|
local: []
|
||||||
|
external: []
|
||||||
|
runtime:
|
||||||
|
config_path: /opt/homelab/config/node-agent
|
||||||
|
data_path: /opt/homelab/state
|
||||||
|
logs_path: /opt/homelab/events
|
||||||
|
|
||||||
|
control-plane:
|
||||||
|
role: management-and-orchestration
|
||||||
|
deployment_model: docker-compose
|
||||||
|
exposure: tailscale-internal
|
||||||
|
offline_required: false
|
||||||
|
depends_on:
|
||||||
|
local:
|
||||||
|
- node-agent
|
||||||
|
external:
|
||||||
|
- piha:redis
|
||||||
|
ports:
|
||||||
|
- name: http
|
||||||
|
container_port: 18180
|
||||||
|
protocol: tcp
|
||||||
|
runtime:
|
||||||
|
config_path: /opt/homelab/config/control-plane
|
||||||
|
data_path: /opt/homelab/data/control-plane
|
||||||
|
logs_path: /opt/homelab/logs/control-plane
|
||||||
|
|
||||||
|
node_exporter:
|
||||||
|
role: metrics-exporter
|
||||||
|
deployment_model: docker-compose
|
||||||
|
exposure: local-only
|
||||||
|
offline_required: true
|
||||||
|
depends_on:
|
||||||
|
local: []
|
||||||
|
external: []
|
||||||
|
|
@ -17,6 +17,10 @@ nodes:
|
||||||
roles:
|
roles:
|
||||||
- infra
|
- infra
|
||||||
- monitoring
|
- monitoring
|
||||||
|
services:
|
||||||
|
- node-agent
|
||||||
|
- ha-diag-agent
|
||||||
|
- brain-watchdog
|
||||||
|
|
||||||
solaria:
|
solaria:
|
||||||
roles:
|
roles:
|
||||||
|
|
@ -27,12 +31,25 @@ nodes:
|
||||||
roles:
|
roles:
|
||||||
- edge
|
- edge
|
||||||
- ingress
|
- ingress
|
||||||
|
- control-plane
|
||||||
|
services:
|
||||||
|
# Repo-managed GitOps services (hosts/vps/services.yaml is authoritative)
|
||||||
|
- node-agent
|
||||||
|
- control-plane # executor, observer, supervisor, operator-ui
|
||||||
|
- node_exporter
|
||||||
|
- stability-agent
|
||||||
|
- npm # Nginx Proxy Manager — public ingress, TLS termination
|
||||||
|
- outline # Team wiki (outline + postgres + redis)
|
||||||
|
- joplin # Note sync server (joplin-server + postgres)
|
||||||
|
- ai-cluster # AI workers: codex-worker, openclaw, planner-worker,
|
||||||
|
# service-ops-worker, redis, mosquitto
|
||||||
|
|
||||||
chelsty:
|
chelsty-infra:
|
||||||
|
site: chelsty
|
||||||
roles:
|
roles:
|
||||||
- remote
|
- remote
|
||||||
- hypervisor
|
- hypervisor
|
||||||
- homeassistant
|
- infra
|
||||||
- staging
|
- staging
|
||||||
connectivity:
|
connectivity:
|
||||||
uplink: lte
|
uplink: lte
|
||||||
|
|
@ -40,10 +57,22 @@ nodes:
|
||||||
home_automation:
|
home_automation:
|
||||||
offline_operation_required: true
|
offline_operation_required: true
|
||||||
services:
|
services:
|
||||||
- homeassistant
|
|
||||||
- zigbee2mqtt
|
- zigbee2mqtt
|
||||||
- mosquitto
|
- mosquitto
|
||||||
coordinator:
|
coordinator:
|
||||||
model: SLZB-06U
|
model: SLZB-06U
|
||||||
connection: network
|
connection: network
|
||||||
usb: false
|
usb: false
|
||||||
|
|
||||||
|
chelsty-ha:
|
||||||
|
site: chelsty
|
||||||
|
roles:
|
||||||
|
- remote
|
||||||
|
- homeassistant
|
||||||
|
connectivity:
|
||||||
|
uplink: lte
|
||||||
|
intermittent: true
|
||||||
|
home_automation:
|
||||||
|
offline_operation_required: true
|
||||||
|
services:
|
||||||
|
- homeassistant
|
||||||
|
|
|
||||||
130
scripts/bootstrap/discover-node.sh
Executable file
130
scripts/bootstrap/discover-node.sh
Executable file
|
|
@ -0,0 +1,130 @@
|
||||||
|
#!/bin/bash
|
||||||
|
# scripts/bootstrap/discover-node.sh
|
||||||
|
# Node discovery script for the homelab platform.
|
||||||
|
# Collects system information and outputs it in JSON format.
|
||||||
|
|
||||||
|
set -e
|
||||||
|
|
||||||
|
# Help function
|
||||||
|
show_help() {
|
||||||
|
echo "Usage: $0 [options]"
|
||||||
|
echo "Options:"
|
||||||
|
echo " --json Output in JSON format (default)"
|
||||||
|
echo " --yaml Output in YAML format"
|
||||||
|
echo " --help Show this help"
|
||||||
|
}
|
||||||
|
|
||||||
|
OUTPUT_FORMAT="json"
|
||||||
|
|
||||||
|
while [[ "$#" -gt 0 ]]; do
|
||||||
|
case $1 in
|
||||||
|
--json) OUTPUT_FORMAT="json"; shift ;;
|
||||||
|
--yaml) OUTPUT_FORMAT="yaml"; shift ;;
|
||||||
|
--help) show_help; exit 0 ;;
|
||||||
|
*) echo "Unknown parameter: $1"; show_help; exit 1 ;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
# Check dependencies
|
||||||
|
for cmd in hostnamectl lscpu free lsblk ip curl; do
|
||||||
|
if ! command -v "$cmd" &> /dev/null; then
|
||||||
|
echo "Error: Required command '$cmd' not found." >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
# Collect Data
|
||||||
|
HOSTNAME=$(hostname)
|
||||||
|
OS_DISTRO=$(grep PRETTY_NAME /etc/os-release | cut -d'"' -f2)
|
||||||
|
ARCH=$(uname -m)
|
||||||
|
CPU_MODEL=$(lscpu | grep "Model name:" | sed 's/Model name:[[:space:]]*//')
|
||||||
|
CPU_CORES=$(lscpu | grep "^CPU(s):" | awk '{print $2}')
|
||||||
|
CPU_THREADS=$(lscpu | grep "^Thread(s) per core:" | awk '{print $4 * $CPU_CORES}') # Simplistic
|
||||||
|
RAM_TOTAL_GB=$(free -g | grep "Mem:" | awk '{print $2}')
|
||||||
|
|
||||||
|
# Disks
|
||||||
|
DISKS=$(lsblk -dno NAME,SIZE,TYPE,MODEL | grep disk | awk '{printf "{\"name\": \"%s\", \"size\": \"%s\", \"model\": \"%s\"},", $1, $2, $4}' | sed 's/,$//')
|
||||||
|
|
||||||
|
# GPU Presence
|
||||||
|
GPU_PRESENT=false
|
||||||
|
if lspci | grep -i 'vga\|3d\|display' | grep -i 'nvidia\|amd\|intel' > /dev/null; then
|
||||||
|
GPU_PRESENT=true
|
||||||
|
GPU_INFO=$(lspci | grep -i 'vga\|3d\|display' | head -n 1 | cut -d ':' -f3 | sed 's/^[[:space:]]*//')
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Virtualization
|
||||||
|
VIRT_SUPPORTED=false
|
||||||
|
if lscpu | grep "Virtualization:" > /dev/null; then
|
||||||
|
VIRT_SUPPORTED=true
|
||||||
|
VIRT_TYPE=$(lscpu | grep "Virtualization:" | awk '{print $2}')
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Network Interfaces
|
||||||
|
INTERFACES=$(ip -j addr show | jq -c '[.[] | {name: .ifname, active: (if .operstate == "UP" then true else false end), ips: [.addr_info[].local]}]' 2>/dev/null || ip addr show | grep '^[0-9]' | awk '{print $2}' | sed 's/://' | xargs -I {} echo -n "\"{}\", " | sed 's/, $//')
|
||||||
|
|
||||||
|
# Tailscale
|
||||||
|
TAILSCALE_STATUS="not-installed"
|
||||||
|
TAILSCALE_IP="null"
|
||||||
|
if command -v tailscale &> /dev/null; then
|
||||||
|
if tailscale status &> /dev/null; then
|
||||||
|
TAILSCALE_STATUS="active"
|
||||||
|
TAILSCALE_IP=$(tailscale ip -4)
|
||||||
|
else
|
||||||
|
TAILSCALE_STATUS="installed-inactive"
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Docker
|
||||||
|
DOCKER_AVAILABLE=false
|
||||||
|
if command -v docker &> /dev/null; then
|
||||||
|
if docker info &> /dev/null; then
|
||||||
|
DOCKER_AVAILABLE=true
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Connectivity
|
||||||
|
CONNECTIVITY="unknown"
|
||||||
|
if curl -s --head https://google.com &> /dev/null; then
|
||||||
|
CONNECTIVITY="internet-access"
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Output Construction (JSON)
|
||||||
|
cat <<EOF
|
||||||
|
{
|
||||||
|
"hostname": "$HOSTNAME",
|
||||||
|
"os": {
|
||||||
|
"distro": "$OS_DISTRO",
|
||||||
|
"arch": "$ARCH"
|
||||||
|
},
|
||||||
|
"hardware": {
|
||||||
|
"cpu": {
|
||||||
|
"model": "$CPU_MODEL",
|
||||||
|
"cores": $CPU_CORES,
|
||||||
|
"threads": $(lscpu | grep "^CPU(s):" | awk '{print $2}')
|
||||||
|
},
|
||||||
|
"memory": {
|
||||||
|
"total_gb": $RAM_TOTAL_GB
|
||||||
|
},
|
||||||
|
"gpu": {
|
||||||
|
"present": $GPU_PRESENT,
|
||||||
|
"info": "${GPU_INFO:-none}"
|
||||||
|
},
|
||||||
|
"disks": [$DISKS]
|
||||||
|
},
|
||||||
|
"virtualization": {
|
||||||
|
"supported": $VIRT_SUPPORTED,
|
||||||
|
"type": "${VIRT_TYPE:-none}"
|
||||||
|
},
|
||||||
|
"network": {
|
||||||
|
"interfaces": $INTERFACES,
|
||||||
|
"tailscale": {
|
||||||
|
"status": "$TAILSCALE_STATUS",
|
||||||
|
"ip": "$TAILSCALE_IP"
|
||||||
|
},
|
||||||
|
"connectivity": "$CONNECTIVITY"
|
||||||
|
},
|
||||||
|
"docker": {
|
||||||
|
"available": $DOCKER_AVAILABLE
|
||||||
|
}
|
||||||
|
}
|
||||||
|
EOF
|
||||||
113
scripts/bootstrap/generate-node-inventory.py
Executable file
113
scripts/bootstrap/generate-node-inventory.py
Executable file
|
|
@ -0,0 +1,113 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
import json
|
||||||
|
import sys
|
||||||
|
import os
|
||||||
|
import yaml
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
def generate_inventory(discovery_data):
|
||||||
|
hostname = discovery_data.get("hostname", "unknown-node")
|
||||||
|
host_dir = Path(f"hosts/{hostname}")
|
||||||
|
host_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
# 1. host.yaml
|
||||||
|
host_yaml = {
|
||||||
|
"hostname": hostname,
|
||||||
|
"roles": ["unassigned"],
|
||||||
|
"network": {
|
||||||
|
"tailscale_ip": discovery_data["network"]["tailscale"]["ip"]
|
||||||
|
},
|
||||||
|
"runtime": {
|
||||||
|
"root": "/opt/homelab"
|
||||||
|
},
|
||||||
|
"deployment": {
|
||||||
|
"mode": "pull",
|
||||||
|
"managed_by": "saturn"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
with open(host_dir / "host.yaml", "w") as f:
|
||||||
|
yaml.dump(host_yaml, f, sort_keys=False)
|
||||||
|
|
||||||
|
# 2. capabilities.yaml
|
||||||
|
capabilities_yaml = {
|
||||||
|
"capabilities": {
|
||||||
|
"hardware": {
|
||||||
|
"cpu": {
|
||||||
|
"arch": discovery_data["os"]["arch"],
|
||||||
|
"cores": discovery_data["hardware"]["cpu"]["cores"],
|
||||||
|
"threads": discovery_data["hardware"]["cpu"]["threads"]
|
||||||
|
},
|
||||||
|
"memory": {
|
||||||
|
"total_gb": discovery_data["hardware"]["memory"]["total_gb"]
|
||||||
|
},
|
||||||
|
"acceleration": {
|
||||||
|
"type": "gpu" if discovery_data["hardware"]["gpu"]["present"] else "none"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"virtualization": {
|
||||||
|
"supported": discovery_data["virtualization"]["supported"],
|
||||||
|
"type": discovery_data["virtualization"]["type"]
|
||||||
|
},
|
||||||
|
"storage": {
|
||||||
|
"persistence": "persistent",
|
||||||
|
"type": "ssd", # Default assumption
|
||||||
|
"capacity_gb": sum([float(d["size"].rstrip("G")) for d in discovery_data["hardware"]["disks"] if "G" in d["size"]]) # Very rough estimate
|
||||||
|
},
|
||||||
|
"networking": {
|
||||||
|
"reachability": "tailscale-only" if discovery_data["network"]["tailscale"]["status"] == "active" else "direct",
|
||||||
|
"ingress_suitability": False,
|
||||||
|
"bandwidth": "unknown"
|
||||||
|
},
|
||||||
|
"runtime": {
|
||||||
|
"container_engine": "docker" if discovery_data["docker"]["available"] else "none",
|
||||||
|
"os": discovery_data["os"]["distro"]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
with open(host_dir / "capabilities.yaml", "w") as f:
|
||||||
|
yaml.dump(capabilities_yaml, f, sort_keys=False)
|
||||||
|
|
||||||
|
# 3. paths.yaml
|
||||||
|
paths_yaml = {
|
||||||
|
"host": hostname,
|
||||||
|
"runtime_root": "/opt/homelab",
|
||||||
|
"conventions": {
|
||||||
|
"services": "/opt/homelab/services",
|
||||||
|
"data": "/opt/homelab/data",
|
||||||
|
"config": "/opt/homelab/config",
|
||||||
|
"logs": "/opt/homelab/logs"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
with open(host_dir / "paths.yaml", "w") as f:
|
||||||
|
yaml.dump(paths_yaml, f, sort_keys=False)
|
||||||
|
|
||||||
|
# 4. networking.yaml
|
||||||
|
networking_yaml = {
|
||||||
|
"host": hostname,
|
||||||
|
"uplink": {
|
||||||
|
"type": "unknown",
|
||||||
|
"connectivity": "unknown"
|
||||||
|
},
|
||||||
|
"tailscale": {
|
||||||
|
"enabled": True if discovery_data["network"]["tailscale"]["status"] == "active" else False,
|
||||||
|
"host_ip": discovery_data["network"]["tailscale"]["ip"],
|
||||||
|
"role": "internal-management"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
with open(host_dir / "networking.yaml", "w") as f:
|
||||||
|
yaml.dump(networking_yaml, f, sort_keys=False)
|
||||||
|
|
||||||
|
print(f"Inventory generated for {hostname} in {host_dir}")
|
||||||
|
|
||||||
|
def main():
|
||||||
|
if len(sys.argv) > 1:
|
||||||
|
with open(sys.argv[1], "r") as f:
|
||||||
|
data = json.load(f)
|
||||||
|
else:
|
||||||
|
# Read from stdin
|
||||||
|
data = json.load(sys.stdin)
|
||||||
|
|
||||||
|
generate_inventory(data)
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
121
scripts/bootstrap/prepare-node.sh
Executable file
121
scripts/bootstrap/prepare-node.sh
Executable file
|
|
@ -0,0 +1,121 @@
|
||||||
|
#!/bin/bash
|
||||||
|
# scripts/bootstrap/prepare-node.sh
|
||||||
|
# Real node preparation script for the homelab platform.
|
||||||
|
# Responsibilities:
|
||||||
|
# - validate Linux environment
|
||||||
|
# - create runtime directories
|
||||||
|
# - install/check dependencies (git, docker, tailscale)
|
||||||
|
# - create homelab runtime layout
|
||||||
|
# - validate Docker daemon
|
||||||
|
# - validate network access
|
||||||
|
# - support idempotent re-runs
|
||||||
|
|
||||||
|
set -e
|
||||||
|
|
||||||
|
# Configuration
|
||||||
|
RUNTIME_ROOT="/opt/homelab"
|
||||||
|
DIRECTORIES=("config" "data" "logs" "state" "backups")
|
||||||
|
LOG_FILE="/tmp/homelab-prepare-node.log"
|
||||||
|
|
||||||
|
# Colors for output
|
||||||
|
RED='\033[0;31m'
|
||||||
|
GREEN='\033[0;32m'
|
||||||
|
YELLOW='\033[1;33m'
|
||||||
|
NC='\033[0m' # No Color
|
||||||
|
|
||||||
|
log() {
|
||||||
|
echo -e "${GREEN}[INFO]${NC} $1" | tee -a "$LOG_FILE"
|
||||||
|
}
|
||||||
|
|
||||||
|
warn() {
|
||||||
|
echo -e "${YELLOW}[WARN]${NC} $1" | tee -a "$LOG_FILE"
|
||||||
|
}
|
||||||
|
|
||||||
|
error() {
|
||||||
|
echo -e "${RED}[ERROR]${NC} $1" | tee -a "$LOG_FILE" >&2
|
||||||
|
exit 1
|
||||||
|
}
|
||||||
|
|
||||||
|
log "Starting homelab node preparation..."
|
||||||
|
|
||||||
|
# 1. Validate Linux environment
|
||||||
|
if [[ "$OSTYPE" != "linux-gnu"* ]]; then
|
||||||
|
error "This script only supports Linux."
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ $EUID -ne 0 ]]; then
|
||||||
|
error "This script must be run as root (use sudo)."
|
||||||
|
fi
|
||||||
|
|
||||||
|
# 2. Create runtime directories
|
||||||
|
log "Creating runtime directories in $RUNTIME_ROOT..."
|
||||||
|
mkdir -p "$RUNTIME_ROOT"
|
||||||
|
for dir in "${DIRECTORIES[@]}"; do
|
||||||
|
mkdir -p "$RUNTIME_ROOT/$dir"
|
||||||
|
done
|
||||||
|
chmod -R 755 "$RUNTIME_ROOT"
|
||||||
|
|
||||||
|
# 3. Install/check dependencies
|
||||||
|
install_apt_deps() {
|
||||||
|
log "Updating apt and installing dependencies..."
|
||||||
|
apt-get update -y
|
||||||
|
apt-get install -y git curl apt-transport-https ca-certificates gnupg lsb-release
|
||||||
|
}
|
||||||
|
|
||||||
|
# Docker installation
|
||||||
|
if ! command -v docker &> /dev/null; then
|
||||||
|
log "Installing Docker..."
|
||||||
|
install_apt_deps
|
||||||
|
curl -fsSL https://get.docker.com -o get-docker.sh
|
||||||
|
sh get-docker.sh
|
||||||
|
rm get-docker.sh
|
||||||
|
else
|
||||||
|
log "Docker is already installed."
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Docker Compose Plugin
|
||||||
|
if ! docker compose version &> /dev/null; then
|
||||||
|
log "Installing Docker Compose plugin..."
|
||||||
|
apt-get update -y
|
||||||
|
apt-get install -y docker-compose-plugin
|
||||||
|
else
|
||||||
|
log "Docker Compose plugin is already installed."
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Tailscale installation
|
||||||
|
if ! command -v tailscale &> /dev/null; then
|
||||||
|
log "Installing Tailscale..."
|
||||||
|
curl -fsSL https://tailscale.com/install.sh | sh
|
||||||
|
else
|
||||||
|
log "Tailscale is already installed."
|
||||||
|
fi
|
||||||
|
|
||||||
|
# 4. Validate Docker daemon
|
||||||
|
log "Validating Docker daemon..."
|
||||||
|
if ! systemctl is-active --quiet docker; then
|
||||||
|
log "Starting Docker service..."
|
||||||
|
systemctl enable --now docker
|
||||||
|
fi
|
||||||
|
|
||||||
|
if ! docker info &> /dev/null; then
|
||||||
|
error "Docker daemon is not responding correctly."
|
||||||
|
fi
|
||||||
|
|
||||||
|
# 5. Validate network access
|
||||||
|
log "Validating network access..."
|
||||||
|
if ! curl -s --head https://google.com | grep "200 OK" > /dev/null; then
|
||||||
|
warn "External network access might be limited."
|
||||||
|
fi
|
||||||
|
|
||||||
|
# 6. Prepare SSH access assumptions
|
||||||
|
log "Checking SSH access assumptions..."
|
||||||
|
if [[ ! -d "$HOME/.ssh" ]]; then
|
||||||
|
mkdir -p "$HOME/.ssh"
|
||||||
|
chmod 700 "$HOME/.ssh"
|
||||||
|
fi
|
||||||
|
# We assume the user has already set up their keys or will do so.
|
||||||
|
# We just ensure the directory exists with correct permissions.
|
||||||
|
|
||||||
|
log "Node preparation completed successfully!"
|
||||||
|
log "Runtime layout at $RUNTIME_ROOT is ready."
|
||||||
|
log "Next step: Run scripts/bootstrap/discover-node.sh to generate discovery data."
|
||||||
75
scripts/bootstrap/vps-control-plane.sh
Executable file
75
scripts/bootstrap/vps-control-plane.sh
Executable file
|
|
@ -0,0 +1,75 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
# vps-control-plane.sh - Bootstrap script for VPS control plane
|
||||||
|
|
||||||
|
set -e
|
||||||
|
|
||||||
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
|
||||||
|
RUNTIME_DIR="/opt/homelab"
|
||||||
|
VPS_CONFIG="$REPO_ROOT/hosts/vps/runtime"
|
||||||
|
|
||||||
|
# Colors for output
|
||||||
|
RED='\033[0;31m'
|
||||||
|
GREEN='\033[0;32m'
|
||||||
|
YELLOW='\033[1;33m'
|
||||||
|
NC='\033[0m' # No Color
|
||||||
|
|
||||||
|
log() { echo -e "${GREEN}[INFO]${NC} $1"; }
|
||||||
|
warn() { echo -e "${YELLOW}[WARN]${NC} $1"; }
|
||||||
|
error() { echo -e "${RED}[ERROR]${NC} $1"; exit 1; }
|
||||||
|
|
||||||
|
log "Starting VPS control plane bootstrap..."
|
||||||
|
|
||||||
|
# 1. Validate Docker availability
|
||||||
|
if ! command -v docker &> /dev/null; then
|
||||||
|
error "Docker is not installed. Please install Docker first."
|
||||||
|
fi
|
||||||
|
|
||||||
|
# 2. Validate compose plugin
|
||||||
|
if ! docker compose version &> /dev/null; then
|
||||||
|
error "Docker Compose plugin is not installed."
|
||||||
|
fi
|
||||||
|
|
||||||
|
log "Docker and Compose plugin verified."
|
||||||
|
|
||||||
|
# 3. Create filesystem-first runtime structure
|
||||||
|
log "Creating filesystem-first runtime structure in $RUNTIME_DIR..."
|
||||||
|
sudo mkdir -p "$RUNTIME_DIR/events" \
|
||||||
|
"$RUNTIME_DIR/state" \
|
||||||
|
"$RUNTIME_DIR/world" \
|
||||||
|
"$RUNTIME_DIR/actions/pending" \
|
||||||
|
"$RUNTIME_DIR/actions/approved" \
|
||||||
|
"$RUNTIME_DIR/actions/running" \
|
||||||
|
"$RUNTIME_DIR/actions/completed" \
|
||||||
|
"$RUNTIME_DIR/actions/failed" \
|
||||||
|
"$RUNTIME_DIR/actions/rejected" \
|
||||||
|
"$RUNTIME_DIR/config" \
|
||||||
|
"$RUNTIME_DIR/logs"
|
||||||
|
|
||||||
|
# 4. Set permissions
|
||||||
|
log "Setting permissions..."
|
||||||
|
sudo chown -R $USER:$USER "$RUNTIME_DIR"
|
||||||
|
chmod -R 755 "$RUNTIME_DIR"
|
||||||
|
|
||||||
|
# 5. Install environment file
|
||||||
|
log "Installing environment configuration..."
|
||||||
|
if [ ! -f "$RUNTIME_DIR/config/control-plane.env" ]; then
|
||||||
|
cp "$VPS_CONFIG/control-plane/env.example" "$RUNTIME_DIR/config/control-plane.env"
|
||||||
|
log "Created $RUNTIME_DIR/config/control-plane.env from template."
|
||||||
|
else
|
||||||
|
warn "Environment file already exists, skipping installation."
|
||||||
|
fi
|
||||||
|
|
||||||
|
# 6. Build and start the control plane
|
||||||
|
log "Building and starting control plane services..."
|
||||||
|
cd "$REPO_ROOT/services/control-plane"
|
||||||
|
docker compose build
|
||||||
|
docker compose up -d
|
||||||
|
|
||||||
|
log "VPS control plane bootstrap complete!"
|
||||||
|
|
||||||
|
echo -e "\n${YELLOW}Verification commands:${NC}"
|
||||||
|
echo "1. Check container status: docker compose ps"
|
||||||
|
echo "2. Check operator UI: curl http://localhost:8080/summary"
|
||||||
|
echo "3. Validate world state: ls -l $RUNTIME_DIR/world"
|
||||||
|
echo "4. Monitor events: tail -f $RUNTIME_DIR/events/*/*/*.json"
|
||||||
23
scripts/deploy/deploy-control-plane.sh
Executable file
23
scripts/deploy/deploy-control-plane.sh
Executable file
|
|
@ -0,0 +1,23 @@
|
||||||
|
#!/bin/bash
|
||||||
|
# scripts/deploy/deploy-control-plane.sh
|
||||||
|
set -e
|
||||||
|
|
||||||
|
VPS_IP="100.95.58.48"
|
||||||
|
USER="oskar"
|
||||||
|
REMOTE_REPO_PATH="/home/oskar/homelab-codex-ws"
|
||||||
|
|
||||||
|
MODE=$1
|
||||||
|
|
||||||
|
case "$MODE" in
|
||||||
|
"--ssh")
|
||||||
|
echo "Deploying to VPS ($VPS_IP) via SSH..."
|
||||||
|
ssh -t "$USER@$VPS_IP" "cd $REMOTE_REPO_PATH && git pull origin master && cd services/control-plane && bash deploy-local.sh"
|
||||||
|
;;
|
||||||
|
"--print")
|
||||||
|
echo "ssh -t $USER@$VPS_IP \"cd $REMOTE_REPO_PATH && git pull origin master && cd services/control-plane && bash deploy-local.sh\""
|
||||||
|
;;
|
||||||
|
*)
|
||||||
|
echo "Usage: $0 [--ssh|--print]"
|
||||||
|
exit 1
|
||||||
|
;;
|
||||||
|
esac
|
||||||
26
scripts/deploy/deploy-frigate.sh
Executable file
26
scripts/deploy/deploy-frigate.sh
Executable file
|
|
@ -0,0 +1,26 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
# deploy-frigate.sh - Deploy Frigate NVR on chelsty-infra (print or SSH)
|
||||||
|
|
||||||
|
MODE="print"
|
||||||
|
[[ "$1" == "--ssh" ]] && MODE="ssh"
|
||||||
|
|
||||||
|
TARGET="100.122.201.22"
|
||||||
|
NODE="chelsty-infra"
|
||||||
|
REPO_PATH="/home/oskar/homelab-codex-ws"
|
||||||
|
SERVICE_PATH="$REPO_PATH/hosts/chelsty-infra/runtime/frigate"
|
||||||
|
|
||||||
|
echo "HOST: $NODE"
|
||||||
|
echo "MODE: $MODE"
|
||||||
|
echo "TARGET: $TARGET"
|
||||||
|
|
||||||
|
# Secrets must exist at /opt/homelab/config/frigate/frigate.env on the node
|
||||||
|
# before first deploy. See config.yml for required variables.
|
||||||
|
DEPLOY_CMD="cd $REPO_PATH && git fetch origin && git checkout master && git pull origin master && cd $SERVICE_PATH && docker-compose pull && docker-compose up -d"
|
||||||
|
|
||||||
|
if [[ "$MODE" == "ssh" ]]; then
|
||||||
|
echo "--- Deploying Frigate to $NODE ($TARGET) via SSH ---"
|
||||||
|
ssh oskar@$TARGET "$DEPLOY_CMD"
|
||||||
|
else
|
||||||
|
echo "# --- Deployment commands for $NODE ---"
|
||||||
|
echo "ssh oskar@$TARGET '$DEPLOY_CMD'"
|
||||||
|
fi
|
||||||
|
|
@ -8,6 +8,7 @@ set -e
|
||||||
REPO_PATH="${HOME}/homelab-codex-ws"
|
REPO_PATH="${HOME}/homelab-codex-ws"
|
||||||
RUNTIME_PATH="/opt/homelab"
|
RUNTIME_PATH="/opt/homelab"
|
||||||
HOSTNAME=$(hostname | tr '[:lower:]' '[:upper:]')
|
HOSTNAME=$(hostname | tr '[:lower:]' '[:upper:]')
|
||||||
|
HOST_DIR="${REPO_PATH}/hosts/$(hostname | tr '[:upper:]' '[:lower:]')"
|
||||||
|
|
||||||
echo "--- Starting Deployment on ${HOSTNAME} ---"
|
echo "--- Starting Deployment on ${HOSTNAME} ---"
|
||||||
|
|
||||||
|
|
@ -22,20 +23,33 @@ echo "Pulling latest changes..."
|
||||||
git pull
|
git pull
|
||||||
|
|
||||||
# 2. Identify Services
|
# 2. Identify Services
|
||||||
# Based on our convention, we look for services assigned to this host
|
SERVICES=()
|
||||||
# For now, we'll check if a 'services.txt' exists in the host folder
|
if [ -f "${HOST_DIR}/services.txt" ]; then
|
||||||
SERVICE_LIST="${REPO_PATH}/hosts/$(hostname | tr '[:upper:]' '[:lower:]')/services.txt"
|
mapfile -t SERVICES < <(grep -v '^\s*#' "${HOST_DIR}/services.txt" | grep -v '^\s*$')
|
||||||
|
elif [ -f "${HOST_DIR}/services.yaml" ]; then
|
||||||
|
SERVICES=($(python3 -c "
|
||||||
|
import yaml, sys
|
||||||
|
try:
|
||||||
|
with open('${HOST_DIR}/services.yaml', 'r') as f:
|
||||||
|
data = yaml.safe_load(f)
|
||||||
|
if data and 'services' in data:
|
||||||
|
if isinstance(data['services'], dict):
|
||||||
|
print(' '.join(data['services'].keys()))
|
||||||
|
elif isinstance(data['services'], list):
|
||||||
|
print(' '.join(data['services']))
|
||||||
|
except Exception as e:
|
||||||
|
print(f'Error parsing YAML: {e}', file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
|
"))
|
||||||
|
fi
|
||||||
|
|
||||||
if [ ! -f "$SERVICE_LIST" ]; then
|
if [ ${#SERVICES[@]} -eq 0 ]; then
|
||||||
echo "No services.txt found for ${HOSTNAME}. Skipping service deployment."
|
echo "No services found for ${HOSTNAME}. Skipping service deployment."
|
||||||
exit 0
|
exit 0
|
||||||
fi
|
fi
|
||||||
|
|
||||||
# 3. Deploy Services
|
# 3. Deploy Services
|
||||||
while IFS= read -r service || [ -n "$service" ]; do
|
for service in "${SERVICES[@]}"; do
|
||||||
[[ "$service" =~ ^#.*$ ]] && continue # Skip comments
|
|
||||||
[[ -z "$service" ]] && continue # Skip empty lines
|
|
||||||
|
|
||||||
echo "Deploying service: ${service}..."
|
echo "Deploying service: ${service}..."
|
||||||
|
|
||||||
COMPOSE_FILE="${REPO_PATH}/services/${service}/docker-compose.yml"
|
COMPOSE_FILE="${REPO_PATH}/services/${service}/docker-compose.yml"
|
||||||
|
|
@ -45,13 +59,10 @@ while IFS= read -r service || [ -n "$service" ]; do
|
||||||
continue
|
continue
|
||||||
fi
|
fi
|
||||||
|
|
||||||
# Target directory in runtime
|
|
||||||
TARGET_DIR="${RUNTIME_PATH}/services/${service}"
|
TARGET_DIR="${RUNTIME_PATH}/services/${service}"
|
||||||
mkdir -p "$TARGET_DIR"
|
mkdir -p "$TARGET_DIR"
|
||||||
|
|
||||||
# We use the compose file from the repo directly
|
OVERRIDE_FILE="${HOST_DIR}/runtime/${service}/docker-compose.override.yml"
|
||||||
# but we can also handle overrides here
|
|
||||||
OVERRIDE_FILE="${RUNTIME_PATH}/config/${service}/docker-compose.override.yml"
|
|
||||||
|
|
||||||
COMPOSE_CMD="docker compose -f ${COMPOSE_FILE}"
|
COMPOSE_CMD="docker compose -f ${COMPOSE_FILE}"
|
||||||
if [ -f "$OVERRIDE_FILE" ]; then
|
if [ -f "$OVERRIDE_FILE" ]; then
|
||||||
|
|
@ -60,7 +71,6 @@ while IFS= read -r service || [ -n "$service" ]; do
|
||||||
fi
|
fi
|
||||||
|
|
||||||
$COMPOSE_CMD up -d --remove-orphans
|
$COMPOSE_CMD up -d --remove-orphans
|
||||||
|
done
|
||||||
done < "$SERVICE_LIST"
|
|
||||||
|
|
||||||
echo "--- Deployment Complete ---"
|
echo "--- Deployment Complete ---"
|
||||||
|
|
|
||||||
55
scripts/deploy/deploy-stability-agent.sh
Executable file
55
scripts/deploy/deploy-stability-agent.sh
Executable file
|
|
@ -0,0 +1,55 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
# deploy-stability-agent.sh - Helper to deploy stability-agent (print or SSH)
|
||||||
|
|
||||||
|
NODE=$1
|
||||||
|
MODE="print"
|
||||||
|
[[ "$2" == "--ssh" ]] && MODE="ssh"
|
||||||
|
|
||||||
|
if [[ -z "$NODE" ]]; then
|
||||||
|
echo "Usage: $0 <node-name> [--ssh]"
|
||||||
|
echo "Supported nodes: chelsty, piha, solaria, vps"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
case "$NODE" in
|
||||||
|
piha) TARGET="100.108.208.3" ;;
|
||||||
|
chelsty) TARGET="100.122.201.22" ;;
|
||||||
|
vps) TARGET="100.95.58.48" ;;
|
||||||
|
solaria) TARGET="local" ;;
|
||||||
|
*)
|
||||||
|
echo "Error: Unknown node '$NODE'"
|
||||||
|
echo "Supported nodes: chelsty, piha, solaria, vps"
|
||||||
|
exit 1
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
|
||||||
|
echo "HOST: $NODE"
|
||||||
|
echo "MODE: $MODE"
|
||||||
|
echo "TARGET: $TARGET"
|
||||||
|
|
||||||
|
REPO_PATH="/home/oskar/homelab-codex-ws"
|
||||||
|
|
||||||
|
if [[ "$NODE" == "solaria" ]]; then
|
||||||
|
if [[ "$MODE" == "ssh" ]]; then
|
||||||
|
echo "--- Running local deployment for solaria ---"
|
||||||
|
cd "$REPO_PATH" && git fetch origin && git checkout master && git pull origin master && cd services/stability-agent && ./deploy-local.sh solaria
|
||||||
|
else
|
||||||
|
echo "# --- Deployment commands for solaria ---"
|
||||||
|
echo "cd $REPO_PATH"
|
||||||
|
echo "git fetch origin"
|
||||||
|
echo "git checkout master"
|
||||||
|
echo "git pull origin master"
|
||||||
|
echo "cd services/stability-agent"
|
||||||
|
echo "./deploy-local.sh solaria"
|
||||||
|
fi
|
||||||
|
else
|
||||||
|
# Remote nodes
|
||||||
|
SSH_CMD="ssh oskar@$TARGET 'cd $REPO_PATH && git fetch origin && git checkout master && git pull origin master && cd services/stability-agent && ./deploy-local.sh $NODE'"
|
||||||
|
if [[ "$MODE" == "ssh" ]]; then
|
||||||
|
echo "--- Deploying to $NODE ($TARGET) via SSH ---"
|
||||||
|
eval "$SSH_CMD"
|
||||||
|
else
|
||||||
|
echo "# --- Deployment commands for $NODE ---"
|
||||||
|
echo "$SSH_CMD"
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
@ -1,268 +1,321 @@
|
||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
# deploy.sh - Staged deployment framework for homelab nodes.
|
# scripts/deploy/deploy.sh — Saturn-side deploy dispatcher
|
||||||
|
# Usage: deploy.sh <target> [--dry-run] [--no-gate]
|
||||||
|
# target ∈ {control-plane, vps, piha, solaria, chelsty-infra}
|
||||||
|
# Exit codes: 0=ok 1=preflight 2=gate 3=execute 4=verify 5=handoff(sudo)
|
||||||
|
|
||||||
set -o pipefail
|
set -uo pipefail
|
||||||
|
|
||||||
# --- Configuration ---
|
REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
|
||||||
export RUNTIME_PATH="/opt/homelab"
|
SSH_USER="${SSH_USER:-oskar}"
|
||||||
export STATE_DIR="${RUNTIME_PATH}/state/deploy"
|
START_TIME=$(date +%s)
|
||||||
export LOG_DIR="${RUNTIME_PATH}/logs/deploy"
|
TARGET=""
|
||||||
export REPO_PATH="${HOME}/homelab-codex-ws"
|
DRY_RUN=false
|
||||||
export TIMESTAMP=$(date +%Y%m%d_%H%M%S)
|
NO_GATE=false
|
||||||
export LOG_FILE="${LOG_DIR}/deploy_${TIMESTAMP}.log"
|
|
||||||
|
|
||||||
# --- Initialization ---
|
usage() {
|
||||||
mkdir -p "$STATE_DIR" "$LOG_DIR"
|
cat >&2 <<'EOF'
|
||||||
|
Usage: deploy.sh <target> [--dry-run] [--no-gate]
|
||||||
|
|
||||||
# Redirection for logging
|
Targets:
|
||||||
exec > >(tee -a "$LOG_FILE") 2>&1
|
control-plane observer/supervisor/executor/operator-ui on VPS
|
||||||
|
vps all VPS GitOps services
|
||||||
|
piha PIHA services
|
||||||
|
solaria SOLARIA compute services
|
||||||
|
chelsty-infra CHELSTY edge node (LTE, longer SSH timeout)
|
||||||
|
|
||||||
# --- Load Libraries ---
|
Flags:
|
||||||
LIB_PATH="${REPO_PATH}/scripts/lib"
|
--dry-run run preflight + gate only; stop before deploy
|
||||||
source "${LIB_PATH}/log.sh"
|
--no-gate skip pytest + docker build (emergency only; logged as WARNING)
|
||||||
source "${LIB_PATH}/state.sh"
|
|
||||||
source "${LIB_PATH}/inventory.sh"
|
|
||||||
source "${LIB_PATH}/compose.sh"
|
|
||||||
source "${LIB_PATH}/diagnostics.sh"
|
|
||||||
|
|
||||||
# --- CLI Parsing ---
|
Exit codes: 0=ok 1=preflight 2=gate 3=execute 4=verify 5=handoff(sudo)
|
||||||
TARGET_HOST=$(hostname)
|
EOF
|
||||||
TARGET_SERVICE=""
|
exit 1
|
||||||
RESUME=false
|
}
|
||||||
REQUESTED_STAGE=""
|
|
||||||
|
|
||||||
while [[ $# -gt 0 ]]; do
|
while [[ $# -gt 0 ]]; do
|
||||||
case $1 in
|
case $1 in
|
||||||
--host)
|
control-plane|vps|piha|solaria|chelsty-infra)
|
||||||
TARGET_HOST="$2"
|
TARGET="$1"; shift ;;
|
||||||
shift 2
|
--dry-run)
|
||||||
;;
|
DRY_RUN=true; shift ;;
|
||||||
--service)
|
--no-gate)
|
||||||
TARGET_SERVICE="$2"
|
NO_GATE=true; shift ;;
|
||||||
shift 2
|
-h|--help)
|
||||||
;;
|
usage ;;
|
||||||
--resume)
|
|
||||||
RESUME=true
|
|
||||||
shift
|
|
||||||
;;
|
|
||||||
--stage)
|
|
||||||
REQUESTED_STAGE="$2"
|
|
||||||
shift 2
|
|
||||||
;;
|
|
||||||
*)
|
*)
|
||||||
if [[ "$1" =~ ^(prepare|validate|deploy|verify|diagnose|complete)$ ]]; then
|
echo "Unknown argument: $1" >&2
|
||||||
REQUESTED_STAGE="$1"
|
usage ;;
|
||||||
fi
|
|
||||||
shift
|
|
||||||
;;
|
|
||||||
esac
|
esac
|
||||||
done
|
done
|
||||||
|
|
||||||
# --- Stages ---
|
[[ -z "$TARGET" ]] && { echo "Error: target is required." >&2; usage; }
|
||||||
|
|
||||||
stage_prepare() {
|
case "$TARGET" in
|
||||||
local host=$1
|
control-plane) SSH_HOST="vps" ;;
|
||||||
if is_stage_complete "prepare" && [[ "$RESUME" == "true" ]]; then
|
*) SSH_HOST="$TARGET" ;;
|
||||||
log "INFO" "Skipping PREPARE (already complete)"
|
|
||||||
return 0
|
|
||||||
fi
|
|
||||||
|
|
||||||
log "INFO" "Stage: PREPARE ($host)"
|
|
||||||
set_stage "prepare"
|
|
||||||
|
|
||||||
cd "$REPO_PATH" || exit 1
|
|
||||||
log "INFO" "Pulling latest changes..."
|
|
||||||
if ! git pull; then
|
|
||||||
log "WARN" "Git pull failed, proceeding with local state (offline mode or network flap)"
|
|
||||||
fi
|
|
||||||
|
|
||||||
# Ensure runtime directories exist
|
|
||||||
mkdir -p "${RUNTIME_PATH}/config" "${RUNTIME_PATH}/data" "${RUNTIME_PATH}/state" "${RUNTIME_PATH}/logs"
|
|
||||||
|
|
||||||
struct_log "prepare" "$host" "all" "success" "repo_updated"
|
|
||||||
mark_stage_complete "prepare"
|
|
||||||
}
|
|
||||||
|
|
||||||
stage_validate() {
|
|
||||||
local host=$1
|
|
||||||
if is_stage_complete "validate" && [[ "$RESUME" == "true" ]]; then
|
|
||||||
log "INFO" "Skipping VALIDATE (already complete)"
|
|
||||||
return 0
|
|
||||||
fi
|
|
||||||
|
|
||||||
log "INFO" "Stage: VALIDATE ($host)"
|
|
||||||
set_stage "validate"
|
|
||||||
|
|
||||||
for service in "${SERVICES[@]}"; do
|
|
||||||
log "INFO" "Validating $service..."
|
|
||||||
if [[ ! -d "${REPO_PATH}/services/$service" ]]; then
|
|
||||||
log "ERROR" "Service definition not found: $service"
|
|
||||||
struct_log "validate" "$host" "$service" "fail" "not_found"
|
|
||||||
return 1
|
|
||||||
fi
|
|
||||||
done
|
|
||||||
|
|
||||||
struct_log "validate" "$host" "all" "success" "validated"
|
|
||||||
mark_stage_complete "validate"
|
|
||||||
}
|
|
||||||
|
|
||||||
stage_deploy() {
|
|
||||||
local host=$1
|
|
||||||
if is_stage_complete "deploy" && [[ "$RESUME" == "true" ]]; then
|
|
||||||
log "INFO" "Skipping DEPLOY (already complete)"
|
|
||||||
return 0
|
|
||||||
fi
|
|
||||||
|
|
||||||
log "INFO" "Stage: DEPLOY ($host)"
|
|
||||||
set_stage "deploy"
|
|
||||||
|
|
||||||
local last_s=$(get_last_service)
|
|
||||||
local skip=false
|
|
||||||
if [[ "$RESUME" == "true" && -n "$last_s" ]]; then
|
|
||||||
skip=true
|
|
||||||
fi
|
|
||||||
|
|
||||||
for service in "${SERVICES[@]}"; do
|
|
||||||
if [[ "$skip" == "true" ]]; then
|
|
||||||
if [[ "$service" == "$last_s" ]]; then
|
|
||||||
skip=false
|
|
||||||
log "INFO" "Resuming from $service..."
|
|
||||||
else
|
|
||||||
log "INFO" "Skipping $service (already processed)"
|
|
||||||
continue
|
|
||||||
fi
|
|
||||||
fi
|
|
||||||
|
|
||||||
log "INFO" "Deploying $service..."
|
|
||||||
set_last_service "$service"
|
|
||||||
|
|
||||||
if ! run_compose_up "$service"; then
|
|
||||||
struct_log "deploy" "$host" "$service" "fail" "docker_compose_failed"
|
|
||||||
collect_diagnostics "$host" "$service"
|
|
||||||
return 1
|
|
||||||
fi
|
|
||||||
|
|
||||||
struct_log "deploy" "$host" "$service" "success" "deployed"
|
|
||||||
done
|
|
||||||
|
|
||||||
set_last_service ""
|
|
||||||
mark_stage_complete "deploy"
|
|
||||||
}
|
|
||||||
|
|
||||||
stage_verify() {
|
|
||||||
local host=$1
|
|
||||||
if is_stage_complete "verify" && [[ "$RESUME" == "true" ]]; then
|
|
||||||
log "INFO" "Skipping VERIFY (already complete)"
|
|
||||||
return 0
|
|
||||||
fi
|
|
||||||
|
|
||||||
log "INFO" "Stage: VERIFY ($host)"
|
|
||||||
set_stage "verify"
|
|
||||||
|
|
||||||
for service in "${SERVICES[@]}"; do
|
|
||||||
log "INFO" "Verifying $service..."
|
|
||||||
local health_script="${REPO_PATH}/services/${service}/healthcheck.sh"
|
|
||||||
if [[ -f "$health_script" ]]; then
|
|
||||||
if ! bash "$health_script"; then
|
|
||||||
log "ERROR" "Healthcheck failed for $service"
|
|
||||||
struct_log "verify" "$host" "$service" "fail" "healthcheck_failed"
|
|
||||||
collect_diagnostics "$host" "$service"
|
|
||||||
return 1
|
|
||||||
fi
|
|
||||||
else
|
|
||||||
# Generic check if container is running
|
|
||||||
if ! docker ps --filter "name=$service" --filter "status=running" | grep -q "$service"; then
|
|
||||||
log "ERROR" "Container $service is not running"
|
|
||||||
struct_log "verify" "$host" "$service" "fail" "container_not_running"
|
|
||||||
collect_diagnostics "$host" "$service"
|
|
||||||
return 1
|
|
||||||
fi
|
|
||||||
fi
|
|
||||||
struct_log "verify" "$host" "$service" "success" "verified"
|
|
||||||
done
|
|
||||||
mark_stage_complete "verify"
|
|
||||||
}
|
|
||||||
|
|
||||||
stage_complete() {
|
|
||||||
local host=$1
|
|
||||||
log "INFO" "Stage: COMPLETE ($host)"
|
|
||||||
set_stage "complete"
|
|
||||||
struct_log "complete" "$host" "all" "success" "deployment_finished"
|
|
||||||
clear_deployment_state
|
|
||||||
}
|
|
||||||
|
|
||||||
# --- Execution Logic ---
|
|
||||||
|
|
||||||
run_deployment() {
|
|
||||||
local start_stage=$1
|
|
||||||
|
|
||||||
# Sequential execution from start_stage
|
|
||||||
case "$start_stage" in
|
|
||||||
prepare)
|
|
||||||
stage_prepare "$TARGET_HOST" || return 1
|
|
||||||
;&
|
|
||||||
validate)
|
|
||||||
stage_validate "$TARGET_HOST" || return 1
|
|
||||||
;&
|
|
||||||
deploy)
|
|
||||||
stage_deploy "$TARGET_HOST" || return 1
|
|
||||||
;&
|
|
||||||
verify)
|
|
||||||
stage_verify "$TARGET_HOST" || return 1
|
|
||||||
;&
|
|
||||||
complete)
|
|
||||||
stage_complete "$TARGET_HOST" || return 1
|
|
||||||
;;
|
|
||||||
*)
|
|
||||||
log "ERROR" "Invalid stage: $start_stage"
|
|
||||||
return 1
|
|
||||||
;;
|
|
||||||
esac
|
esac
|
||||||
}
|
|
||||||
|
|
||||||
# --- Main ---
|
case "$TARGET" in
|
||||||
|
chelsty-*) SSH_TIMEOUT=30 ;;
|
||||||
|
*) SSH_TIMEOUT=5 ;;
|
||||||
|
esac
|
||||||
|
|
||||||
log "INFO" "--- Homelab Deployment Started (Host: $TARGET_HOST, Service: ${TARGET_SERVICE:-all}) ---"
|
# ── PREFLIGHT ────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
if ! load_inventory "$TARGET_HOST" "$TARGET_SERVICE"; then
|
preflight() {
|
||||||
log "ERROR" "Failed to load inventory"
|
echo "=== PREFLIGHT ==="
|
||||||
|
|
||||||
|
local branch
|
||||||
|
branch=$(git -C "$REPO_ROOT" rev-parse --abbrev-ref HEAD)
|
||||||
|
if [[ "$branch" != "master" ]]; then
|
||||||
|
echo "ERROR: On branch '${branch}', not master. Switch to master and push first." >&2
|
||||||
exit 1
|
exit 1
|
||||||
fi
|
fi
|
||||||
|
echo "[ok] branch: master"
|
||||||
|
|
||||||
EXIT_STATUS=0
|
if ! git -C "$REPO_ROOT" diff --quiet; then
|
||||||
if [[ "$RESUME" == "true" ]]; then
|
echo "ERROR: Unstaged changes in working tree. Commit or stash before deploying." >&2
|
||||||
CURRENT=$(get_stage)
|
|
||||||
log "INFO" "Resuming from state: $CURRENT"
|
|
||||||
case "$CURRENT" in
|
|
||||||
prepare|validate|deploy|verify)
|
|
||||||
run_deployment "$CURRENT" || EXIT_STATUS=1
|
|
||||||
;;
|
|
||||||
complete|none)
|
|
||||||
log "INFO" "No interrupted deployment found. Starting from scratch..."
|
|
||||||
run_deployment "prepare" || EXIT_STATUS=1
|
|
||||||
;;
|
|
||||||
*)
|
|
||||||
log "INFO" "Unknown state. Starting from prepare..."
|
|
||||||
run_deployment "prepare" || EXIT_STATUS=1
|
|
||||||
;;
|
|
||||||
esac
|
|
||||||
elif [[ -n "$REQUESTED_STAGE" ]]; then
|
|
||||||
if [[ "$REQUESTED_STAGE" == "diagnose" ]]; then
|
|
||||||
collect_diagnostics "$TARGET_HOST" "$TARGET_SERVICE"
|
|
||||||
else
|
|
||||||
run_deployment "$REQUESTED_STAGE" || EXIT_STATUS=1
|
|
||||||
fi
|
|
||||||
else
|
|
||||||
# New deployment - clear previous state
|
|
||||||
clear_deployment_state
|
|
||||||
run_deployment "prepare" || EXIT_STATUS=1
|
|
||||||
fi
|
|
||||||
|
|
||||||
if [[ $EXIT_STATUS -eq 0 ]]; then
|
|
||||||
print_summary "$TARGET_HOST" "SUCCESS"
|
|
||||||
log "INFO" "--- Homelab Deployment Finished Successfully ---"
|
|
||||||
else
|
|
||||||
print_summary "$TARGET_HOST" "FAILED"
|
|
||||||
log "ERROR" "--- Homelab Deployment Failed ---"
|
|
||||||
exit 1
|
exit 1
|
||||||
fi
|
fi
|
||||||
|
if ! git -C "$REPO_ROOT" diff --cached --quiet; then
|
||||||
|
echo "ERROR: Staged but uncommitted changes. Commit before deploying." >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
echo "[ok] working tree clean"
|
||||||
|
|
||||||
|
git -C "$REPO_ROOT" fetch origin master --quiet
|
||||||
|
local unpushed
|
||||||
|
unpushed=$(git -C "$REPO_ROOT" log origin/master..HEAD --oneline)
|
||||||
|
if [[ -n "$unpushed" ]]; then
|
||||||
|
echo "ERROR: Unpushed commits on master:" >&2
|
||||||
|
echo "$unpushed" >&2
|
||||||
|
echo "Push first: git push origin master" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
echo "[ok] no unpushed commits"
|
||||||
|
|
||||||
|
echo "Checking SSH: ${SSH_USER}@${SSH_HOST} (ConnectTimeout=${SSH_TIMEOUT}s)..."
|
||||||
|
if ! ssh -o "ConnectTimeout=${SSH_TIMEOUT}" -o BatchMode=yes \
|
||||||
|
"${SSH_USER}@${SSH_HOST}" true 2>/dev/null; then
|
||||||
|
echo "ERROR: Cannot reach ${SSH_HOST} via SSH (timeout ${SSH_TIMEOUT}s)." >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
echo "[ok] ${SSH_HOST} reachable"
|
||||||
|
}
|
||||||
|
|
||||||
|
# ── GATE ─────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
gate() {
|
||||||
|
if [[ "$NO_GATE" == "true" ]]; then
|
||||||
|
echo "=== GATE: SKIPPED ==="
|
||||||
|
echo "WARNING: --no-gate active — pytest + docker build bypassed (emergency mode)." >&2
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "=== GATE ==="
|
||||||
|
|
||||||
|
local services=()
|
||||||
|
|
||||||
|
if [[ "$TARGET" == "control-plane" ]]; then
|
||||||
|
services=("control-plane")
|
||||||
|
else
|
||||||
|
local svc_yaml="${REPO_ROOT}/hosts/${TARGET}/services.yaml"
|
||||||
|
if [[ ! -f "$svc_yaml" ]]; then
|
||||||
|
echo "ERROR: ${svc_yaml} not found." >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
local svc_list
|
||||||
|
svc_list=$(python3 -c "
|
||||||
|
import yaml
|
||||||
|
with open('${svc_yaml}') as f:
|
||||||
|
data = yaml.safe_load(f)
|
||||||
|
svcs = data.get('services', {})
|
||||||
|
if isinstance(svcs, dict):
|
||||||
|
print('\n'.join(svcs.keys()))
|
||||||
|
elif isinstance(svcs, list):
|
||||||
|
print('\n'.join(svcs))
|
||||||
|
")
|
||||||
|
while IFS= read -r svc; do
|
||||||
|
[[ -z "$svc" ]] && continue
|
||||||
|
if [[ -f "${REPO_ROOT}/services/${svc}/Dockerfile" ]]; then
|
||||||
|
services+=("$svc")
|
||||||
|
fi
|
||||||
|
done <<< "$svc_list"
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ ${#services[@]} -eq 0 ]]; then
|
||||||
|
echo "[info] No services with local Dockerfile found for ${TARGET} — gate trivially passes."
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "Services under gate: ${services[*]}"
|
||||||
|
local gate_failed=false
|
||||||
|
|
||||||
|
for svc in "${services[@]}"; do
|
||||||
|
local svc_dir="${REPO_ROOT}/services/${svc}"
|
||||||
|
|
||||||
|
if [[ -d "${svc_dir}/tests" ]]; then
|
||||||
|
echo "--- pytest: ${svc} ---"
|
||||||
|
if ! python3 -m pytest "${svc_dir}/tests" -q; then
|
||||||
|
echo "GATE FAIL: pytest failed for ${svc}" >&2
|
||||||
|
gate_failed=true
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "--- docker build: ${svc} ---"
|
||||||
|
if ! docker build --quiet "${svc_dir}" >/dev/null; then
|
||||||
|
echo "GATE FAIL: docker build failed for ${svc}" >&2
|
||||||
|
gate_failed=true
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
if [[ "$gate_failed" == "true" ]]; then
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
echo "[ok] gate passed"
|
||||||
|
}
|
||||||
|
|
||||||
|
# ── EXECUTE ──────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
execute() {
|
||||||
|
echo "=== EXECUTE ==="
|
||||||
|
|
||||||
|
local cmd_output
|
||||||
|
local cmd_exit=0
|
||||||
|
|
||||||
|
if [[ "$TARGET" == "control-plane" ]]; then
|
||||||
|
echo "Running deploy-control-plane.sh --ssh..."
|
||||||
|
cmd_output=$("${REPO_ROOT}/scripts/deploy/deploy-control-plane.sh" --ssh 2>&1) \
|
||||||
|
|| cmd_exit=$?
|
||||||
|
else
|
||||||
|
echo "SSHing to ${SSH_HOST}: git pull + deploy-node.sh..."
|
||||||
|
cmd_output=$(ssh -o "ConnectTimeout=${SSH_TIMEOUT}" -o BatchMode=yes \
|
||||||
|
"${SSH_USER}@${SSH_HOST}" \
|
||||||
|
'cd ~/homelab-codex-ws && git pull && ./scripts/deploy/deploy-node.sh' 2>&1) \
|
||||||
|
|| cmd_exit=$?
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "$cmd_output"
|
||||||
|
|
||||||
|
if echo "$cmd_output" | grep -qF "[sudo] password"; then
|
||||||
|
echo "" >&2
|
||||||
|
echo "ERROR (exit 5): Deploy hit an interactive sudo prompt." >&2
|
||||||
|
echo "Run manually:" >&2
|
||||||
|
if [[ "$TARGET" == "control-plane" ]]; then
|
||||||
|
echo " ssh -t ${SSH_USER}@${SSH_HOST} 'cd ~/homelab-codex-ws && git pull origin master && cd services/control-plane && bash deploy-local.sh'" >&2
|
||||||
|
else
|
||||||
|
echo " ssh -t ${SSH_USER}@${SSH_HOST} 'cd ~/homelab-codex-ws && git pull && ./scripts/deploy/deploy-node.sh'" >&2
|
||||||
|
fi
|
||||||
|
exit 5
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ $cmd_exit -ne 0 ]]; then
|
||||||
|
echo "ERROR: Deploy command exited ${cmd_exit}." >&2
|
||||||
|
exit 3
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "[ok] execute completed"
|
||||||
|
}
|
||||||
|
|
||||||
|
# ── VERIFY ───────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
verify() {
|
||||||
|
echo "=== VERIFY ==="
|
||||||
|
|
||||||
|
local ps_output
|
||||||
|
local ps_exit=0
|
||||||
|
ps_output=$(ssh -o "ConnectTimeout=${SSH_TIMEOUT}" -o BatchMode=yes \
|
||||||
|
"${SSH_USER}@${SSH_HOST}" \
|
||||||
|
'docker ps --format "{{.Names}}\t{{.Status}}"' 2>&1) \
|
||||||
|
|| ps_exit=$?
|
||||||
|
|
||||||
|
if [[ $ps_exit -ne 0 ]]; then
|
||||||
|
echo "ERROR: docker ps failed on ${SSH_HOST}:" >&2
|
||||||
|
echo "$ps_output" >&2
|
||||||
|
exit 4
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "$ps_output"
|
||||||
|
|
||||||
|
local failed=false
|
||||||
|
|
||||||
|
local not_up
|
||||||
|
not_up=$(echo "$ps_output" | grep -v '^$' | grep -v $'\tUp' || true)
|
||||||
|
if [[ -n "$not_up" ]]; then
|
||||||
|
echo "ERROR: Containers not in Up state:" >&2
|
||||||
|
echo "$not_up" >&2
|
||||||
|
failed=true
|
||||||
|
fi
|
||||||
|
|
||||||
|
local unhealthy
|
||||||
|
unhealthy=$(echo "$ps_output" | grep '(unhealthy)' || true)
|
||||||
|
if [[ -n "$unhealthy" ]]; then
|
||||||
|
echo "ERROR: Unhealthy containers:" >&2
|
||||||
|
echo "$unhealthy" >&2
|
||||||
|
failed=true
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ "$TARGET" == "control-plane" ]]; then
|
||||||
|
for cp_svc in supervisor observer executor operator-ui; do
|
||||||
|
if ! echo "$ps_output" | grep -q "$cp_svc"; then
|
||||||
|
echo "ERROR: control-plane component absent from docker ps: ${cp_svc}" >&2
|
||||||
|
failed=true
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ "$failed" == "true" ]]; then
|
||||||
|
echo "" >&2
|
||||||
|
echo "Full docker ps output above." >&2
|
||||||
|
exit 4
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "[ok] all containers healthy"
|
||||||
|
}
|
||||||
|
|
||||||
|
# ── REPORT ───────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
report() {
|
||||||
|
local mode="${1:-deploy}"
|
||||||
|
local end_time
|
||||||
|
end_time=$(date +%s)
|
||||||
|
local elapsed
|
||||||
|
elapsed=$(( end_time - START_TIME ))
|
||||||
|
local commit_hash
|
||||||
|
commit_hash=$(git -C "$REPO_ROOT" rev-parse --short HEAD)
|
||||||
|
local gate_s verify_s
|
||||||
|
|
||||||
|
if [[ "$NO_GATE" == "true" ]]; then
|
||||||
|
gate_s="skip"
|
||||||
|
else
|
||||||
|
gate_s="ok"
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ "$mode" == "dry-run" ]]; then
|
||||||
|
verify_s="skip(dry-run)"
|
||||||
|
else
|
||||||
|
verify_s="green"
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
if [[ "$mode" == "dry-run" ]]; then
|
||||||
|
echo "DRY RUN OK | target=${TARGET} | commit=${commit_hash} | gate=${gate_s} | verify=${verify_s} | ${elapsed}s"
|
||||||
|
else
|
||||||
|
echo "DEPLOY OK | target=${TARGET} | commit=${commit_hash} | gate=${gate_s} | verify=${verify_s} | ${elapsed}s"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# ── MAIN ─────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
preflight
|
||||||
|
gate
|
||||||
|
|
||||||
|
if [[ "$DRY_RUN" == "true" ]]; then
|
||||||
|
report dry-run
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
execute
|
||||||
|
verify
|
||||||
|
report
|
||||||
|
|
|
||||||
|
|
@ -1,15 +1,30 @@
|
||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
# orchestrate-deploy.sh - To be run on SATURN
|
# orchestrate-deploy.sh - To be run on SATURN
|
||||||
# Triggers deployment on remote execution nodes.
|
# Triggers deployment on remote execution nodes via inventory.
|
||||||
|
|
||||||
set -e
|
set -e
|
||||||
|
|
||||||
HOSTS=("solaria" "piha" "vps")
|
REPO_PATH="${HOME}/homelab-codex-ws"
|
||||||
USER="oskar" # Default user
|
USER="oskar"
|
||||||
|
|
||||||
for HOST in "${HOSTS[@]}"; do
|
while IFS=' ' read -r HOST TAG; do
|
||||||
echo ">>> Triggering deployment on ${HOST}..."
|
echo ">>> Triggering deployment on ${HOST}..."
|
||||||
|
if [[ "$TAG" == "lte" ]]; then
|
||||||
|
ssh -o ConnectTimeout=30 "${USER}@${HOST}" "bash ~/homelab-codex-ws/scripts/deploy/deploy-node.sh" || \
|
||||||
|
echo "WARNING: Deployment on ${HOST} failed or timed out (LTE/intermittent node, skipping)"
|
||||||
|
else
|
||||||
ssh "${USER}@${HOST}" "bash ~/homelab-codex-ws/scripts/deploy/deploy-node.sh"
|
ssh "${USER}@${HOST}" "bash ~/homelab-codex-ws/scripts/deploy/deploy-node.sh"
|
||||||
done
|
fi
|
||||||
|
done < <(python3 -c "
|
||||||
|
import yaml, sys
|
||||||
|
with open('${REPO_PATH}/inventory/topology.yaml') as f:
|
||||||
|
data = yaml.safe_load(f)
|
||||||
|
skip = {'saturn', 'solaria'}
|
||||||
|
for name, info in (data.get('nodes') or {}).items():
|
||||||
|
if name in skip:
|
||||||
|
continue
|
||||||
|
uplink = ((info or {}).get('connectivity') or {}).get('uplink', '')
|
||||||
|
print(name, 'lte' if uplink == 'lte' else 'standard')
|
||||||
|
")
|
||||||
|
|
||||||
echo ">>> All deployments triggered."
|
echo ">>> All deployments triggered."
|
||||||
|
|
|
||||||
68
scripts/deploy/verify-agent-fleet.sh
Executable file
68
scripts/deploy/verify-agent-fleet.sh
Executable file
|
|
@ -0,0 +1,68 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
# verify-agent-fleet.sh - Check the status of stability agents across the fleet
|
||||||
|
|
||||||
|
REDIS_CMD="docker exec agent-system-redis redis-cli --raw"
|
||||||
|
|
||||||
|
# Check if docker is available
|
||||||
|
if ! command -v docker &> /dev/null; then
|
||||||
|
echo "Error: docker command not found."
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Check if container is running
|
||||||
|
if ! docker ps --filter "name=agent-system-redis" --format "{{.Names}}" | grep -q "agent-system-redis"; then
|
||||||
|
echo "Error: agent-system-redis container not found or not running."
|
||||||
|
echo "This script must be run on PIHA (the node hosting the Redis container)."
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
REQUIRED_NODES=("piha" "chelsty" "solaria" "vps")
|
||||||
|
MISSING_NODES=0
|
||||||
|
|
||||||
|
echo "--- Homelab Agent Fleet Status ---"
|
||||||
|
printf "%-10s %-15s %-10s %-10s %-30s\n" "NODE" "HOSTNAME" "HEALTH" "STATUS" "LAST_SEEN"
|
||||||
|
printf "%s\n" "--------------------------------------------------------------------------------"
|
||||||
|
|
||||||
|
for NODE in "${REQUIRED_NODES[@]}"; do
|
||||||
|
KEY="homelab:nodes:$NODE"
|
||||||
|
|
||||||
|
# Check if key exists
|
||||||
|
EXISTS=$($REDIS_CMD EXISTS "$KEY" 2>/dev/null | tr -d '\r\n')
|
||||||
|
|
||||||
|
if [[ "$EXISTS" != "1" ]]; then
|
||||||
|
printf "%-10s %-15s %-10s %-10s %-30s\n" "$NODE" "MISSING" "N/A" "N/A" "N/A"
|
||||||
|
MISSING_NODES=$((MISSING_NODES + 1))
|
||||||
|
continue
|
||||||
|
fi
|
||||||
|
|
||||||
|
HOSTNAME=$($REDIS_CMD HGET "$KEY" hostname 2>/dev/null | tr -d '\r\n')
|
||||||
|
HEALTH=$($REDIS_CMD HGET "$KEY" health 2>/dev/null | tr -d '\r\n')
|
||||||
|
STATUS=$($REDIS_CMD HGET "$KEY" status 2>/dev/null | tr -d '\r\n')
|
||||||
|
LAST_SEEN=$($REDIS_CMD HGET "$KEY" last_seen 2>/dev/null | tr -d '\r\n')
|
||||||
|
|
||||||
|
printf "%-10s %-15s %-10s %-10s %-30s\n" "$NODE" "$HOSTNAME" "$HEALTH" "$STATUS" "$LAST_SEEN"
|
||||||
|
done
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo "--- Control Plane Summary ---"
|
||||||
|
if command -v jq >/dev/null; then
|
||||||
|
curl -s http://127.0.0.1:18180/summary | jq .
|
||||||
|
else
|
||||||
|
curl -s http://127.0.0.1:18180/summary
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo "--- Control Plane Nodes ---"
|
||||||
|
if command -v jq >/dev/null; then
|
||||||
|
curl -s http://127.0.0.1:18180/nodes | jq .
|
||||||
|
else
|
||||||
|
curl -s http://127.0.0.1:18180/nodes
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ $MISSING_NODES -gt 0 ]]; then
|
||||||
|
echo ""
|
||||||
|
echo "Error: $MISSING_NODES required nodes are missing from Redis."
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
exit 0
|
||||||
361
scripts/dev/agent.sh
Executable file
361
scripts/dev/agent.sh
Executable file
|
|
@ -0,0 +1,361 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
# Multi-agent worktree manager.
|
||||||
|
# EXIT: 0 ok, 1 preflight, 2 operation failed.
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
trap 'echo "agent.sh: failed at line $LINENO (exit $?)" >&2' ERR
|
||||||
|
|
||||||
|
RESERVED_NAMES=(master main HEAD list merge clean new)
|
||||||
|
MAX_WORKTREES=4
|
||||||
|
|
||||||
|
die() { echo "ERROR: $*" >&2; exit "${2:-2}"; }
|
||||||
|
prefail(){ echo "PREFLIGHT: $*" >&2; exit 1; }
|
||||||
|
|
||||||
|
# ── helpers ──────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
is_main_checkout() {
|
||||||
|
local git_dir common_dir
|
||||||
|
git_dir=$(git rev-parse --git-dir 2>/dev/null) || return 1
|
||||||
|
common_dir=$(git rev-parse --git-common-dir 2>/dev/null) || return 1
|
||||||
|
[ "$git_dir" = "$common_dir" ]
|
||||||
|
}
|
||||||
|
|
||||||
|
require_main_checkout() {
|
||||||
|
is_main_checkout || prefail "must run from the main checkout, not a worktree"
|
||||||
|
}
|
||||||
|
|
||||||
|
require_master_branch() {
|
||||||
|
local branch
|
||||||
|
branch=$(git rev-parse --abbrev-ref HEAD)
|
||||||
|
[ "$branch" = "master" ] || prefail "must be on master (currently on '$branch')"
|
||||||
|
}
|
||||||
|
|
||||||
|
require_clean_tree() {
|
||||||
|
local dirty
|
||||||
|
dirty=$(git status --porcelain)
|
||||||
|
[ -z "$dirty" ] || prefail "working tree is not clean — stash or commit first"
|
||||||
|
}
|
||||||
|
|
||||||
|
worktree_paths() {
|
||||||
|
# list worktree paths (excluding main); || true prevents grep exit-1 when empty
|
||||||
|
local main_path
|
||||||
|
main_path=$(git rev-parse --show-toplevel)
|
||||||
|
git worktree list --porcelain \
|
||||||
|
| awk '/^worktree /{p=$2} /^$/{print p}' \
|
||||||
|
| grep -v "^${main_path}$" \
|
||||||
|
|| true
|
||||||
|
}
|
||||||
|
|
||||||
|
worktree_count() {
|
||||||
|
worktree_paths | wc -l
|
||||||
|
}
|
||||||
|
|
||||||
|
branch_exists_local() { git show-ref --verify --quiet "refs/heads/$1"; }
|
||||||
|
branch_exists_remote() { git ls-remote --exit-code origin "$1" >/dev/null 2>&1; }
|
||||||
|
|
||||||
|
utc_now() { date -u +"%Y-%m-%dT%H:%M:%SZ"; }
|
||||||
|
|
||||||
|
age_str() {
|
||||||
|
local created_utc="$1"
|
||||||
|
local now_ts created_ts diff_s
|
||||||
|
now_ts=$(date -u +%s)
|
||||||
|
# strip Z, replace T with space for `date -d`
|
||||||
|
created_ts=$(date -u -d "${created_utc//T/ }" +%s 2>/dev/null) || { echo "?"; return; }
|
||||||
|
diff_s=$(( now_ts - created_ts ))
|
||||||
|
if (( diff_s < 60 )); then echo "${diff_s}s"
|
||||||
|
elif (( diff_s < 3600 )); then echo "$(( diff_s/60 ))m"
|
||||||
|
elif (( diff_s < 86400 )); then echo "$(( diff_s/3600 ))h"
|
||||||
|
else echo "$(( diff_s/86400 ))d"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
validate_name() {
|
||||||
|
local name="$1"
|
||||||
|
if ! [[ "$name" =~ ^[a-z][a-z0-9-]*$ ]]; then
|
||||||
|
prefail "name '$name' must match ^[a-z][a-z0-9-]*$"
|
||||||
|
fi
|
||||||
|
for r in "${RESERVED_NAMES[@]}"; do
|
||||||
|
if [ "$name" = "$r" ]; then
|
||||||
|
prefail "'$name' is a reserved word"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
}
|
||||||
|
|
||||||
|
# ── subcommands ───────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
cmd_new() {
|
||||||
|
local name="${1:-}"
|
||||||
|
[ -n "$name" ] || { usage; exit 1; }
|
||||||
|
|
||||||
|
validate_name "$name"
|
||||||
|
require_main_checkout
|
||||||
|
require_master_branch
|
||||||
|
require_clean_tree
|
||||||
|
|
||||||
|
# worktree limit
|
||||||
|
local count
|
||||||
|
count=$(worktree_count)
|
||||||
|
if (( count >= MAX_WORKTREES )); then
|
||||||
|
echo "ERROR: already at maximum of $MAX_WORKTREES active worktrees:" >&2
|
||||||
|
cmd_list
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
# branch collision
|
||||||
|
if branch_exists_local "task/$name"; then
|
||||||
|
prefail "branch task/$name already exists locally"
|
||||||
|
fi
|
||||||
|
git fetch origin master --quiet
|
||||||
|
if branch_exists_remote "refs/heads/task/$name"; then
|
||||||
|
prefail "branch task/$name already exists on origin"
|
||||||
|
fi
|
||||||
|
|
||||||
|
# directory collision
|
||||||
|
local main_path wt_path
|
||||||
|
main_path=$(git rev-parse --show-toplevel)
|
||||||
|
wt_path="$(dirname "$main_path")/homelab-codex-ws-${name}"
|
||||||
|
[ ! -e "$wt_path" ] || prefail "directory $wt_path already exists"
|
||||||
|
|
||||||
|
# create worktree
|
||||||
|
git worktree add -b "task/$name" "$wt_path" origin/master \
|
||||||
|
|| die "git worktree add failed"
|
||||||
|
|
||||||
|
# write marker
|
||||||
|
local parent_commit
|
||||||
|
parent_commit=$(git rev-parse origin/master)
|
||||||
|
cat > "$wt_path/.agent-task" <<EOF
|
||||||
|
task: $name
|
||||||
|
branch: task/$name
|
||||||
|
parent_commit: $parent_commit
|
||||||
|
created_utc: $(utc_now)
|
||||||
|
worktree_path: $wt_path
|
||||||
|
EOF
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo "Worktree created: $wt_path"
|
||||||
|
echo "Branch: task/$name"
|
||||||
|
echo ""
|
||||||
|
echo "── Start Claude Code in this worktree ──────────────────────────────────────"
|
||||||
|
echo "cd ~/homelab-codex-ws-${name} && claude --dangerously-skip-permissions \"Jesteś w worktree task '${name}' (branch task/${name}). NAJPIERW przeczytaj .agent-task i .claude/skills/worktree-aware/SKILL.md, dopiero potem zacznij pracę. Commituj wyłącznie na swoją gałąź; nie pushuj origin master.\""
|
||||||
|
echo "─────────────────────────────────────────────────────────────────────────────"
|
||||||
|
}
|
||||||
|
|
||||||
|
cmd_list() {
|
||||||
|
local main_path
|
||||||
|
main_path=$(git rev-parse --show-toplevel)
|
||||||
|
|
||||||
|
# fetch to get up-to-date ahead/behind
|
||||||
|
git fetch origin master --quiet 2>/dev/null || true
|
||||||
|
|
||||||
|
local paths
|
||||||
|
paths=$(worktree_paths)
|
||||||
|
|
||||||
|
if [ -z "$paths" ]; then
|
||||||
|
echo "(no active task worktrees)"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf "%-20s %-25s %-10s %-8s %-8s %-7s %s\n" \
|
||||||
|
"NAME" "BRANCH" "CREATED" "AGE" "STATUS" "A/B" "PARENT"
|
||||||
|
|
||||||
|
while IFS= read -r wt_path; do
|
||||||
|
[ -z "$wt_path" ] && continue
|
||||||
|
|
||||||
|
local marker="$wt_path/.agent-task"
|
||||||
|
local task_name branch parent_commit created_utc
|
||||||
|
if [ -f "$marker" ]; then
|
||||||
|
task_name=$( grep '^task:' "$marker" | awk '{print $2}')
|
||||||
|
branch=$( grep '^branch:' "$marker" | awk '{print $2}')
|
||||||
|
parent_commit=$(grep '^parent_commit:' "$marker" | awk '{print $2}')
|
||||||
|
created_utc=$(grep '^created_utc:' "$marker" | awk '{print $2}')
|
||||||
|
else
|
||||||
|
task_name="(no marker)"
|
||||||
|
branch=$(git -C "$wt_path" rev-parse --abbrev-ref HEAD 2>/dev/null || echo "?")
|
||||||
|
parent_commit="?"
|
||||||
|
created_utc=""
|
||||||
|
fi
|
||||||
|
|
||||||
|
local status="clean"
|
||||||
|
local dirty
|
||||||
|
dirty=$(git -C "$wt_path" status --porcelain 2>/dev/null || echo "?")
|
||||||
|
[ -n "$dirty" ] && status="dirty"
|
||||||
|
|
||||||
|
local ahead behind ab
|
||||||
|
ahead=$(git -C "$wt_path" rev-list --count "origin/master..${branch}" 2>/dev/null || echo "?")
|
||||||
|
behind=$(git -C "$wt_path" rev-list --count "${branch}..origin/master" 2>/dev/null || echo "?")
|
||||||
|
ab="+${ahead}/-${behind}"
|
||||||
|
|
||||||
|
local age=""
|
||||||
|
[ -n "$created_utc" ] && age=$(age_str "$created_utc")
|
||||||
|
|
||||||
|
local short_parent="${parent_commit:0:7}"
|
||||||
|
local short_created="${created_utc:0:10}"
|
||||||
|
|
||||||
|
printf "%-20s %-25s %-10s %-8s %-8s %-7s %s\n" \
|
||||||
|
"$task_name" "$branch" "$short_created" "$age" "$status" "$ab" "$short_parent"
|
||||||
|
done <<< "$paths"
|
||||||
|
}
|
||||||
|
|
||||||
|
cmd_merge() {
|
||||||
|
local name="${1:-}"
|
||||||
|
[ -n "$name" ] || { usage; exit 1; }
|
||||||
|
|
||||||
|
require_main_checkout
|
||||||
|
require_master_branch
|
||||||
|
require_clean_tree
|
||||||
|
|
||||||
|
git fetch origin --quiet
|
||||||
|
|
||||||
|
branch_exists_local "task/$name" || die "branch task/$name not found locally" 1
|
||||||
|
|
||||||
|
local main_path wt_path
|
||||||
|
main_path=$(git rev-parse --show-toplevel)
|
||||||
|
wt_path="$(dirname "$main_path")/homelab-codex-ws-${name}"
|
||||||
|
|
||||||
|
# attempt ff-only merge
|
||||||
|
local merge_failed=0
|
||||||
|
git merge --ff-only "task/$name" || merge_failed=1
|
||||||
|
|
||||||
|
if (( merge_failed )); then
|
||||||
|
# abort any partial merge state
|
||||||
|
git merge --abort 2>/dev/null || true
|
||||||
|
echo ""
|
||||||
|
echo "ERROR: task/$name cannot be fast-forwarded into master." >&2
|
||||||
|
echo " The branch has likely diverged from master." >&2
|
||||||
|
echo "" >&2
|
||||||
|
echo "Diagnose with:" >&2
|
||||||
|
echo " git log master..task/$name # commits only on task branch" >&2
|
||||||
|
echo " git log task/$name..master # commits master has that task doesn't" >&2
|
||||||
|
echo "" >&2
|
||||||
|
echo "Then decide: rebase task/$name onto master, or merge manually." >&2
|
||||||
|
echo "Worktree and branch are preserved — no changes made." >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "Merged task/$name into master (fast-forward)."
|
||||||
|
|
||||||
|
git push origin master || die "git push origin master failed"
|
||||||
|
echo "Pushed master to origin."
|
||||||
|
|
||||||
|
if [ -d "$wt_path" ]; then
|
||||||
|
git worktree remove "$wt_path" || die "git worktree remove $wt_path failed"
|
||||||
|
echo "Removed worktree: $wt_path"
|
||||||
|
else
|
||||||
|
echo "(worktree directory $wt_path not found — skipping worktree remove)"
|
||||||
|
fi
|
||||||
|
|
||||||
|
git branch -d "task/$name" || die "git branch -d task/$name failed"
|
||||||
|
echo "Deleted local branch task/$name."
|
||||||
|
|
||||||
|
git push origin --delete "task/$name" 2>/dev/null \
|
||||||
|
&& echo "Deleted remote branch task/$name." \
|
||||||
|
|| echo "(remote branch task/$name not found — nothing to delete)"
|
||||||
|
|
||||||
|
echo ""
|
||||||
|
echo "Done. task/$name merged and cleaned up."
|
||||||
|
}
|
||||||
|
|
||||||
|
cmd_clean() {
|
||||||
|
local main_path
|
||||||
|
main_path=$(git rev-parse --show-toplevel)
|
||||||
|
git fetch origin --quiet 2>/dev/null || true
|
||||||
|
|
||||||
|
local to_remove=()
|
||||||
|
|
||||||
|
# orphaned registered worktrees: branch deleted or fully merged into master
|
||||||
|
local paths
|
||||||
|
paths=$(worktree_paths)
|
||||||
|
while IFS= read -r wt_path; do
|
||||||
|
[ -z "$wt_path" ] && continue
|
||||||
|
local branch
|
||||||
|
branch=$(git -C "$wt_path" rev-parse --abbrev-ref HEAD 2>/dev/null || echo "")
|
||||||
|
[ -z "$branch" ] && { to_remove+=("worktree:$wt_path (unreadable branch)"); continue; }
|
||||||
|
|
||||||
|
# branch gone locally?
|
||||||
|
if ! branch_exists_local "$branch"; then
|
||||||
|
to_remove+=("worktree:$wt_path (branch $branch no longer exists)")
|
||||||
|
continue
|
||||||
|
fi
|
||||||
|
|
||||||
|
# branch fully merged into master?
|
||||||
|
local ahead
|
||||||
|
ahead=$(git rev-list --count "origin/master..${branch}" 2>/dev/null || echo "1")
|
||||||
|
if [ "$ahead" = "0" ]; then
|
||||||
|
to_remove+=("worktree:$wt_path (branch $branch fully merged into origin/master)")
|
||||||
|
fi
|
||||||
|
done <<< "$paths"
|
||||||
|
|
||||||
|
# dangling directories: ../homelab-codex-ws-* not registered
|
||||||
|
local registered_paths
|
||||||
|
registered_paths=$(git worktree list --porcelain | awk '/^worktree /{print $2}')
|
||||||
|
local parent_dir
|
||||||
|
parent_dir=$(dirname "$main_path")
|
||||||
|
while IFS= read -r candidate; do
|
||||||
|
[ -d "$candidate" ] || continue
|
||||||
|
if ! echo "$registered_paths" | grep -qF "$candidate"; then
|
||||||
|
to_remove+=("dangling:$candidate")
|
||||||
|
fi
|
||||||
|
done < <(find "$parent_dir" -maxdepth 1 -name "homelab-codex-ws-*" -type d 2>/dev/null)
|
||||||
|
|
||||||
|
if [ ${#to_remove[@]} -eq 0 ]; then
|
||||||
|
echo "Nothing to clean."
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "Found ${#to_remove[@]} item(s) to clean:"
|
||||||
|
for entry in "${to_remove[@]}"; do
|
||||||
|
echo " $entry"
|
||||||
|
done
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
local overall_rc=0
|
||||||
|
for entry in "${to_remove[@]}"; do
|
||||||
|
local kind="${entry%%:*}"
|
||||||
|
local path="${entry#*:}"
|
||||||
|
# strip trailing annotation in parens
|
||||||
|
local raw_path
|
||||||
|
raw_path="${path%% (*}"
|
||||||
|
|
||||||
|
local confirm
|
||||||
|
read -r -p "Remove $kind '$raw_path'? [y/N] " confirm
|
||||||
|
if [[ "$confirm" =~ ^[Yy]$ ]]; then
|
||||||
|
if [ "$kind" = "worktree" ]; then
|
||||||
|
git worktree remove --force "$raw_path" 2>/dev/null \
|
||||||
|
|| { echo " WARNING: git worktree remove failed, trying rm -rf"; rm -rf "$raw_path" || true; }
|
||||||
|
else
|
||||||
|
rm -rf "$raw_path"
|
||||||
|
fi
|
||||||
|
echo " Removed."
|
||||||
|
else
|
||||||
|
echo " Skipped."
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
return $overall_rc
|
||||||
|
}
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
cat <<'EOF'
|
||||||
|
Usage: agent.sh <subcommand> [args]
|
||||||
|
|
||||||
|
agent.sh new <name> Create a new task worktree (branch task/<name>)
|
||||||
|
agent.sh list List active task worktrees with status
|
||||||
|
agent.sh merge <name> Fast-forward merge task/<name> into master and clean up
|
||||||
|
agent.sh clean Remove orphaned or dangling worktrees (interactive)
|
||||||
|
|
||||||
|
EXIT: 0 ok, 1 preflight, 2 operation failed.
|
||||||
|
EOF
|
||||||
|
}
|
||||||
|
|
||||||
|
# ── dispatch ──────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
SUBCOMMAND="${1:-}"
|
||||||
|
shift || true
|
||||||
|
|
||||||
|
case "$SUBCOMMAND" in
|
||||||
|
new) cmd_new "$@" ;;
|
||||||
|
list) cmd_list "$@" ;;
|
||||||
|
merge) cmd_merge "$@" ;;
|
||||||
|
clean) cmd_clean "$@" ;;
|
||||||
|
*) usage; exit 1 ;;
|
||||||
|
esac
|
||||||
|
|
@ -6,6 +6,10 @@ collect_diagnostics() {
|
||||||
local service=$2
|
local service=$2
|
||||||
log "INFO" "Stage: DIAGNOSE ($host - ${service:-all})"
|
log "INFO" "Stage: DIAGNOSE ($host - ${service:-all})"
|
||||||
|
|
||||||
|
if [[ -n "$service" ]]; then
|
||||||
|
emit_event "remediation_started" "warning" "diagnostics.sh" "$service" "${TIMESTAMP}" "{\"reason\": \"failure_detected\"}"
|
||||||
|
fi
|
||||||
|
|
||||||
local diag_file="${LOG_DIR}/diagnostics_${TIMESTAMP}.txt"
|
local diag_file="${LOG_DIR}/diagnostics_${TIMESTAMP}.txt"
|
||||||
{
|
{
|
||||||
echo "--- DIAGNOSTICS FOR ${service:-all} (Host: $host, Time: $(date)) ---"
|
echo "--- DIAGNOSTICS FOR ${service:-all} (Host: $host, Time: $(date)) ---"
|
||||||
|
|
|
||||||
85
scripts/lib/events.py
Normal file
85
scripts/lib/events.py
Normal file
|
|
@ -0,0 +1,85 @@
|
||||||
|
import os
|
||||||
|
import json
|
||||||
|
import datetime
|
||||||
|
import uuid
|
||||||
|
import socket
|
||||||
|
|
||||||
|
EVENTS_BASE_DIR = os.getenv("RUNTIME_PATH", "/opt/homelab") + "/events"
|
||||||
|
|
||||||
|
def emit_event(event_type, severity, source, service, correlation_id, payload=None):
|
||||||
|
"""
|
||||||
|
Emits a normalized JSON event to the filesystem.
|
||||||
|
"""
|
||||||
|
if payload is None:
|
||||||
|
payload = {}
|
||||||
|
|
||||||
|
node = socket.gethostname()
|
||||||
|
now = datetime.datetime.now(datetime.timezone.utc)
|
||||||
|
timestamp = now.strftime("%Y-%m-%dT%H:%M:%SZ")
|
||||||
|
date_dir = now.strftime("%Y-%m-%d")
|
||||||
|
|
||||||
|
event_dir = os.path.join(EVENTS_BASE_DIR, date_dir, node)
|
||||||
|
os.makedirs(event_dir, exist_ok=True)
|
||||||
|
|
||||||
|
event_id = str(uuid.uuid4())
|
||||||
|
filename = f"{timestamp}_{event_type}_{event_id}.json"
|
||||||
|
event_path = os.path.join(event_dir, filename)
|
||||||
|
|
||||||
|
event_data = {
|
||||||
|
"timestamp": timestamp,
|
||||||
|
"node": node,
|
||||||
|
"type": event_type,
|
||||||
|
"severity": severity,
|
||||||
|
"source": source,
|
||||||
|
"service": service,
|
||||||
|
"correlation_id": correlation_id,
|
||||||
|
"payload": payload
|
||||||
|
}
|
||||||
|
|
||||||
|
with open(event_path, "w") as f:
|
||||||
|
json.dump(event_data, f, indent=2)
|
||||||
|
|
||||||
|
return event_path
|
||||||
|
|
||||||
|
def list_events(date_str=None, node=None):
|
||||||
|
"""
|
||||||
|
Lists paths to event files for a specific date and/or node.
|
||||||
|
"""
|
||||||
|
if date_str is None:
|
||||||
|
date_str = datetime.datetime.now(datetime.timezone.utc).strftime("%Y-%m-%d")
|
||||||
|
|
||||||
|
search_path = os.path.join(EVENTS_BASE_DIR, date_str)
|
||||||
|
if node:
|
||||||
|
search_path = os.path.join(search_path, node)
|
||||||
|
|
||||||
|
if not os.path.exists(search_path):
|
||||||
|
return []
|
||||||
|
|
||||||
|
event_files = []
|
||||||
|
for root, dirs, files in os.walk(search_path):
|
||||||
|
for file in files:
|
||||||
|
if file.endswith(".json"):
|
||||||
|
event_files.append(os.path.join(root, file))
|
||||||
|
|
||||||
|
return sorted(event_files)
|
||||||
|
|
||||||
|
def get_event(event_path):
|
||||||
|
"""
|
||||||
|
Reads and parses an event file.
|
||||||
|
"""
|
||||||
|
with open(event_path, "r") as f:
|
||||||
|
return json.load(f)
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
# Simple CLI for emitting events from Python
|
||||||
|
import sys
|
||||||
|
if len(sys.argv) > 1 and sys.argv[1] == "emit":
|
||||||
|
# emit <type> <severity> <source> <service> <cid> [payload_json]
|
||||||
|
etype = sys.argv[2]
|
||||||
|
sev = sys.argv[3]
|
||||||
|
src = sys.argv[4]
|
||||||
|
svc = sys.argv[5]
|
||||||
|
cid = sys.argv[6]
|
||||||
|
payload = json.loads(sys.argv[7]) if len(sys.argv) > 7 else {}
|
||||||
|
path = emit_event(etype, sev, src, svc, cid, payload)
|
||||||
|
print(f"Event emitted: {path}")
|
||||||
79
scripts/lib/events.sh
Normal file
79
scripts/lib/events.sh
Normal file
|
|
@ -0,0 +1,79 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
# events.sh - Filesystem-first event system for homelab
|
||||||
|
|
||||||
|
EVENTS_BASE_DIR="${RUNTIME_PATH:-/opt/homelab}/events"
|
||||||
|
|
||||||
|
# Emit a normalized JSON event
|
||||||
|
# Usage: emit_event <type> <severity> <source> <service> <correlation_id> <payload_json>
|
||||||
|
emit_event() {
|
||||||
|
local type=$1
|
||||||
|
local severity=$2
|
||||||
|
local source=$3
|
||||||
|
local service=$4
|
||||||
|
local correlation_id=$5
|
||||||
|
local payload=${6:-"{}"}
|
||||||
|
|
||||||
|
local node=$(hostname)
|
||||||
|
local timestamp=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
|
||||||
|
local date_dir=$(date +"%Y-%m-%d")
|
||||||
|
|
||||||
|
local event_dir="${EVENTS_BASE_DIR}/${date_dir}/${node}"
|
||||||
|
mkdir -p "$event_dir"
|
||||||
|
|
||||||
|
# Generate a unique filename for the event to ensure append-only/no-overwrite
|
||||||
|
local event_id=$(cat /proc/sys/kernel/random/uuid 2>/dev/null || date +%s%N)
|
||||||
|
local event_file="${event_dir}/${timestamp}_${type}_${event_id}.json"
|
||||||
|
|
||||||
|
# Construct JSON
|
||||||
|
cat <<EOF > "$event_file"
|
||||||
|
{
|
||||||
|
"timestamp": "$timestamp",
|
||||||
|
"node": "$node",
|
||||||
|
"type": "$type",
|
||||||
|
"severity": "$severity",
|
||||||
|
"source": "$source",
|
||||||
|
"service": "$service",
|
||||||
|
"correlation_id": "$correlation_id",
|
||||||
|
"payload": $payload
|
||||||
|
}
|
||||||
|
EOF
|
||||||
|
|
||||||
|
# Also log to standard logging if available
|
||||||
|
if command -v log >/dev/null 2>&1; then
|
||||||
|
log "EVENT" "[$type] service=$service severity=$severity cid=$correlation_id"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Query recent events (last N events or by date)
|
||||||
|
# Usage: list_events [date] [node]
|
||||||
|
list_events() {
|
||||||
|
local target_date=${1:-$(date +"%Y-%m-%d")}
|
||||||
|
local target_node=$2
|
||||||
|
|
||||||
|
local search_path="${EVENTS_BASE_DIR}/${target_date}"
|
||||||
|
if [[ -n "$target_node" ]]; then
|
||||||
|
search_path="${search_path}/${target_node}"
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ -d "$search_path" ]]; then
|
||||||
|
find "$search_path" -name "*.json" | sort
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Simple filter helper
|
||||||
|
# Usage: filter_events <field> <value>
|
||||||
|
filter_events() {
|
||||||
|
local field=$1
|
||||||
|
local value=$2
|
||||||
|
local files=$3
|
||||||
|
|
||||||
|
for f in $files; do
|
||||||
|
if grep -q "\"$field\": \"$value\"" "$f"; then
|
||||||
|
echo "$f"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
}
|
||||||
|
|
||||||
|
# export -f emit_event
|
||||||
|
# export -f list_events
|
||||||
|
# export -f filter_events
|
||||||
|
|
@ -8,6 +8,11 @@ log() {
|
||||||
echo "[$(date +'%Y-%m-%d %H:%M:%S')] [$level] $message"
|
echo "[$(date +'%Y-%m-%d %H:%M:%S')] [$level] $message"
|
||||||
}
|
}
|
||||||
|
|
||||||
|
# --- Load Events Library ---
|
||||||
|
if [[ -f "${LIB_PATH:-$(dirname "${BASH_SOURCE[0]}")}/events.sh" ]]; then
|
||||||
|
source "${LIB_PATH:-$(dirname "${BASH_SOURCE[0]}")}/events.sh"
|
||||||
|
fi
|
||||||
|
|
||||||
# Structured log for machine reading
|
# Structured log for machine reading
|
||||||
# timestamp, stage, host, service, command_result, info
|
# timestamp, stage, host, service, command_result, info
|
||||||
struct_log() {
|
struct_log() {
|
||||||
|
|
@ -17,7 +22,34 @@ struct_log() {
|
||||||
local result=$4
|
local result=$4
|
||||||
local info=$5
|
local info=$5
|
||||||
log "STRUCT" "stage=$stage host=$host service=$service result=$result info=\"$info\""
|
log "STRUCT" "stage=$stage host=$host service=$service result=$result info=\"$info\""
|
||||||
|
|
||||||
|
# Emit event if it matches normalized types
|
||||||
|
local event_type=""
|
||||||
|
local severity="info"
|
||||||
|
|
||||||
|
case "$stage" in
|
||||||
|
"deploy")
|
||||||
|
if [[ "$result" == "success" ]]; then
|
||||||
|
event_type="deployment_completed"
|
||||||
|
elif [[ "$result" == "fail" ]]; then
|
||||||
|
event_type="deployment_failed"
|
||||||
|
severity="error"
|
||||||
|
else
|
||||||
|
event_type="deployment_started"
|
||||||
|
fi
|
||||||
|
;;
|
||||||
|
"validate")
|
||||||
|
if [[ "$result" == "fail" ]]; then
|
||||||
|
event_type="deployment_failed"
|
||||||
|
severity="error"
|
||||||
|
fi
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
|
||||||
|
if [[ -n "$event_type" ]] && command -v emit_event >/dev/null 2>&1; then
|
||||||
|
emit_event "$event_type" "$severity" "deploy.sh" "$service" "${TIMESTAMP:-$(date +%s)}" "{\"stage\": \"$stage\", \"info\": \"$info\"}"
|
||||||
|
fi
|
||||||
}
|
}
|
||||||
|
|
||||||
export -f log
|
# export -f log
|
||||||
export -f struct_log
|
# export -f struct_log
|
||||||
|
|
|
||||||
338
scripts/monitor/health-monitor.sh
Executable file
338
scripts/monitor/health-monitor.sh
Executable file
|
|
@ -0,0 +1,338 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
# health-monitor.sh - Homelab node health monitor and safe disk cleanup
|
||||||
|
#
|
||||||
|
# Designed to run standalone on the host (cron or direct) or to be called by
|
||||||
|
# the node-agent Python daemon. All cleanup decisions follow the conservative
|
||||||
|
# policy agreed in the design review:
|
||||||
|
#
|
||||||
|
# lte_node (chelsty-infra, chelsty-ha) : NO cleanup at all
|
||||||
|
# sd_card (piha, saturn) : dangling images + stopped containers,
|
||||||
|
# rate-limited to once per 24 h
|
||||||
|
# ai_node (solaria) : dangling images + stopped containers
|
||||||
|
# + build cache (NEVER -a)
|
||||||
|
# standard (vps) : dangling images + stopped containers
|
||||||
|
# + build cache
|
||||||
|
#
|
||||||
|
# VPS additionally rotates control-plane filesystem artefacts:
|
||||||
|
# actions/completed + failed > 7 days
|
||||||
|
# logs/deploy > 30 days
|
||||||
|
# events/** > 3 days AND past observer checkpoint
|
||||||
|
#
|
||||||
|
# NEVER TOUCHED (any node): /opt/homelab/data/, config/, state/,
|
||||||
|
# actions/pending|approved|running, Frigate recordings, Ollama models,
|
||||||
|
# Zigbee2MQTT data, Mosquitto data, HA database/config.
|
||||||
|
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Configuration
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
RUNTIME_PATH="${RUNTIME_PATH:-/opt/homelab}"
|
||||||
|
EVENTS_DIR="${RUNTIME_PATH}/events"
|
||||||
|
STATE_DIR="${RUNTIME_PATH}/state"
|
||||||
|
LOGS_DIR="${RUNTIME_PATH}/logs"
|
||||||
|
ACTIONS_DIR="${RUNTIME_PATH}/actions"
|
||||||
|
|
||||||
|
NODE_NAME="${NODE_NAME:-$(hostname)}"
|
||||||
|
TIMESTAMP=$(date +%s)
|
||||||
|
DATE=$(date -u +%Y-%m-%dT%H:%M:%SZ)
|
||||||
|
|
||||||
|
# Thresholds
|
||||||
|
DISK_WARN_PCT=75
|
||||||
|
DISK_CRIT_PCT=85
|
||||||
|
MEM_WARN_PCT=85
|
||||||
|
MEM_CRIT_PCT=95
|
||||||
|
|
||||||
|
# Rate-limit file for SD-card nodes (max one Docker cleanup per 24 h)
|
||||||
|
CLEANUP_LOCK="${STATE_DIR}/last-docker-cleanup"
|
||||||
|
CLEANUP_INTERVAL=86400 # seconds
|
||||||
|
|
||||||
|
# Node classifications
|
||||||
|
LTE_NODES="chelsty-infra chelsty-ha"
|
||||||
|
SD_CARD_NODES="piha saturn"
|
||||||
|
AI_NODES="solaria"
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Helpers
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
log() { echo "$(date -u +%H:%M:%S) [INFO] $*"; }
|
||||||
|
warn() { echo "$(date -u +%H:%M:%S) [WARN] $*" >&2; }
|
||||||
|
err() { echo "$(date -u +%H:%M:%S) [ERROR] $*" >&2; }
|
||||||
|
|
||||||
|
contains() {
|
||||||
|
local word="$1"; shift
|
||||||
|
for w in "$@"; do [[ "$w" == "$word" ]] && return 0; done
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
|
||||||
|
get_node_type() {
|
||||||
|
# shellcheck disable=SC2086
|
||||||
|
if contains "$NODE_NAME" $LTE_NODES; then echo "lte_node"; return; fi
|
||||||
|
if contains "$NODE_NAME" $SD_CARD_NODES; then echo "sd_card"; return; fi
|
||||||
|
if contains "$NODE_NAME" $AI_NODES; then echo "ai_node"; return; fi
|
||||||
|
echo "standard"
|
||||||
|
}
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Event emission
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
emit_event() {
|
||||||
|
local type="$1" severity="$2" service="${3:-}" message="$4" payload="${5:-{}}"
|
||||||
|
local id="evt-${NODE_NAME}-${TIMESTAMP}-${type}"
|
||||||
|
local dir="${EVENTS_DIR}/${NODE_NAME}"
|
||||||
|
mkdir -p "$dir"
|
||||||
|
cat > "${dir}/${id}.json" <<EOF
|
||||||
|
{
|
||||||
|
"id": "${id}",
|
||||||
|
"timestamp": ${TIMESTAMP},
|
||||||
|
"date": "${DATE}",
|
||||||
|
"type": "${type}",
|
||||||
|
"severity": "${severity}",
|
||||||
|
"node": "${NODE_NAME}",
|
||||||
|
"service": "${service}",
|
||||||
|
"message": "${message}",
|
||||||
|
"payload": ${payload}
|
||||||
|
}
|
||||||
|
EOF
|
||||||
|
}
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Health checks
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
check_disk() {
|
||||||
|
# Use /opt/homelab as the check target — it lives on the host filesystem
|
||||||
|
# and this path is correct both when running natively and in a container
|
||||||
|
# that mounts /opt/homelab from the host.
|
||||||
|
local mount="${RUNTIME_PATH}"
|
||||||
|
local usage_pct avail_mb total_mb
|
||||||
|
usage_pct=$(df "${mount}" 2>/dev/null | awk 'NR==2 {gsub(/%/,"",$5); print $5}') || return
|
||||||
|
avail_mb=$(df "${mount}" 2>/dev/null | awk 'NR==2 {printf "%d", $4/1024}') || return
|
||||||
|
total_mb=$(df "${mount}" 2>/dev/null | awk 'NR==2 {printf "%d", $2/1024}') || return
|
||||||
|
|
||||||
|
if [[ "${usage_pct}" -ge "${DISK_CRIT_PCT}" ]]; then
|
||||||
|
warn "Disk CRITICAL: ${usage_pct}% used (${avail_mb} MB free)"
|
||||||
|
emit_event "disk_pressure" "high" "" \
|
||||||
|
"Disk usage critical: ${usage_pct}% on ${mount} (${avail_mb} MB free)" \
|
||||||
|
"{\"usage_pct\": ${usage_pct}, \"avail_mb\": ${avail_mb}, \"total_mb\": ${total_mb}, \"mount\": \"${mount}\"}"
|
||||||
|
elif [[ "${usage_pct}" -ge "${DISK_WARN_PCT}" ]]; then
|
||||||
|
warn "Disk elevated: ${usage_pct}% used"
|
||||||
|
emit_event "disk_pressure" "medium" "" \
|
||||||
|
"Disk usage elevated: ${usage_pct}% on ${mount} (${avail_mb} MB free)" \
|
||||||
|
"{\"usage_pct\": ${usage_pct}, \"avail_mb\": ${avail_mb}, \"total_mb\": ${total_mb}, \"mount\": \"${mount}\"}"
|
||||||
|
fi
|
||||||
|
echo "${usage_pct}"
|
||||||
|
}
|
||||||
|
|
||||||
|
check_memory() {
|
||||||
|
local total avail pct avail_mb
|
||||||
|
total=$(awk '/^MemTotal/ {print $2}' /proc/meminfo)
|
||||||
|
avail=$(awk '/^MemAvailable/ {print $2}' /proc/meminfo)
|
||||||
|
pct=$(( (total - avail) * 100 / total ))
|
||||||
|
avail_mb=$(( avail / 1024 ))
|
||||||
|
|
||||||
|
if [[ "${pct}" -ge "${MEM_CRIT_PCT}" ]]; then
|
||||||
|
warn "Memory CRITICAL: ${pct}% used"
|
||||||
|
emit_event "high_memory" "high" "" \
|
||||||
|
"Memory usage critical: ${pct}% (${avail_mb} MB available)" \
|
||||||
|
"{\"usage_pct\": ${pct}, \"avail_mb\": ${avail_mb}, \"total_mb\": $((total/1024))}"
|
||||||
|
elif [[ "${pct}" -ge "${MEM_WARN_PCT}" ]]; then
|
||||||
|
warn "Memory elevated: ${pct}%"
|
||||||
|
emit_event "high_memory" "medium" "" \
|
||||||
|
"Memory usage elevated: ${pct}% (${avail_mb} MB available)" \
|
||||||
|
"{\"usage_pct\": ${pct}, \"avail_mb\": ${avail_mb}, \"total_mb\": $((total/1024))}"
|
||||||
|
fi
|
||||||
|
echo "${pct}"
|
||||||
|
}
|
||||||
|
|
||||||
|
check_cpu() {
|
||||||
|
# Two-sample /proc/stat delta for accurate instantaneous CPU usage.
|
||||||
|
local idle1 total1 idle2 total2 pct
|
||||||
|
read -r idle1 total1 < <(awk '/^cpu / {idle=$5; total=0; for(i=2;i<=NF;i++) total+=$i; print idle, total}' /proc/stat)
|
||||||
|
sleep 1
|
||||||
|
read -r idle2 total2 < <(awk '/^cpu / {idle=$5; total=0; for(i=2;i<=NF;i++) total+=$i; print idle, total}' /proc/stat)
|
||||||
|
|
||||||
|
local d_idle=$(( idle2 - idle1 ))
|
||||||
|
local d_total=$(( total2 - total1 ))
|
||||||
|
pct=$(( d_total > 0 ? 100 - d_idle * 100 / d_total : 0 ))
|
||||||
|
|
||||||
|
if [[ "${pct}" -ge 90 ]]; then
|
||||||
|
warn "CPU elevated: ${pct}%"
|
||||||
|
emit_event "high_cpu" "medium" "" \
|
||||||
|
"CPU usage elevated: ${pct}%" \
|
||||||
|
"{\"usage_pct\": ${pct}}"
|
||||||
|
fi
|
||||||
|
echo "${pct}"
|
||||||
|
}
|
||||||
|
|
||||||
|
check_containers() {
|
||||||
|
command -v docker &>/dev/null || return
|
||||||
|
|
||||||
|
# Containers that have exited but carry a restart policy meaning they should be up
|
||||||
|
local cname
|
||||||
|
while IFS= read -r cname; do
|
||||||
|
[[ -z "$cname" ]] && continue
|
||||||
|
warn "Container exited (should be running): ${cname}"
|
||||||
|
emit_event "containers_not_running" "high" "${cname}" \
|
||||||
|
"Container '${cname}' has exited unexpectedly (restart=unless-stopped)" \
|
||||||
|
"{\"container\": \"${cname}\"}"
|
||||||
|
done < <(docker ps -a \
|
||||||
|
--filter "status=exited" \
|
||||||
|
--filter "label=com.docker.compose.project" \
|
||||||
|
--format "{{.Names}}" 2>/dev/null || true)
|
||||||
|
|
||||||
|
# Containers that are running but their health check is failing
|
||||||
|
while IFS= read -r cname; do
|
||||||
|
[[ -z "$cname" ]] && continue
|
||||||
|
warn "Container unhealthy: ${cname}"
|
||||||
|
emit_event "healthcheck_failed" "high" "${cname}" \
|
||||||
|
"Container '${cname}' is running but health check is failing" \
|
||||||
|
"{\"container\": \"${cname}\"}"
|
||||||
|
done < <(docker ps \
|
||||||
|
--filter "health=unhealthy" \
|
||||||
|
--format "{{.Names}}" 2>/dev/null || true)
|
||||||
|
}
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Safe Docker cleanup (per policy)
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
_sd_card_rate_ok() {
|
||||||
|
if [[ -f "${CLEANUP_LOCK}" ]]; then
|
||||||
|
local last_ts elapsed
|
||||||
|
last_ts=$(cat "${CLEANUP_LOCK}" 2>/dev/null || echo 0)
|
||||||
|
elapsed=$(( TIMESTAMP - last_ts ))
|
||||||
|
if [[ "${elapsed}" -lt "${CLEANUP_INTERVAL}" ]]; then
|
||||||
|
log "Docker cleanup skipped: last run ${elapsed}s ago (limit ${CLEANUP_INTERVAL}s)"
|
||||||
|
return 1
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
_mark_cleanup_done() {
|
||||||
|
echo "${TIMESTAMP}" > "${CLEANUP_LOCK}"
|
||||||
|
}
|
||||||
|
|
||||||
|
run_safe_cleanup() {
|
||||||
|
command -v docker &>/dev/null || return
|
||||||
|
local node_type
|
||||||
|
node_type=$(get_node_type)
|
||||||
|
|
||||||
|
case "${node_type}" in
|
||||||
|
lte_node)
|
||||||
|
# NO cleanup on LTE nodes. Any docker operation risks triggering
|
||||||
|
# a pull over a metered/intermittent connection.
|
||||||
|
log "Skipping Docker cleanup: LTE node (${NODE_NAME})"
|
||||||
|
;;
|
||||||
|
|
||||||
|
sd_card)
|
||||||
|
# Dangling images + stopped containers only.
|
||||||
|
# Rate-limited to once per 24 hours to protect SD card write endurance.
|
||||||
|
_sd_card_rate_ok || return
|
||||||
|
log "Running rate-limited Docker cleanup (SD card node)"
|
||||||
|
docker image prune -f >/dev/null 2>&1 || true
|
||||||
|
docker container prune -f >/dev/null 2>&1 || true
|
||||||
|
_mark_cleanup_done
|
||||||
|
;;
|
||||||
|
|
||||||
|
ai_node)
|
||||||
|
# Dangling images + stopped containers + build cache.
|
||||||
|
# NEVER docker image prune -a (would remove Ollama runtime images,
|
||||||
|
# requiring a multi-hour re-pull of model weights).
|
||||||
|
log "Running AI-node Docker cleanup (dangling images + containers + build cache)"
|
||||||
|
docker image prune -f >/dev/null 2>&1 || true
|
||||||
|
docker container prune -f >/dev/null 2>&1 || true
|
||||||
|
docker builder prune -f >/dev/null 2>&1 || true
|
||||||
|
;;
|
||||||
|
|
||||||
|
standard)
|
||||||
|
# VPS and other standard nodes: full safe cleanup.
|
||||||
|
log "Running standard Docker cleanup"
|
||||||
|
docker image prune -f >/dev/null 2>&1 || true
|
||||||
|
docker container prune -f >/dev/null 2>&1 || true
|
||||||
|
docker builder prune -f >/dev/null 2>&1 || true
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
}
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# VPS-specific: control-plane filesystem rotation
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
cleanup_control_plane_fs() {
|
||||||
|
log "Running control-plane filesystem rotation"
|
||||||
|
|
||||||
|
# Completed / failed actions older than 7 days
|
||||||
|
for status in completed failed; do
|
||||||
|
local dir="${ACTIONS_DIR}/${status}"
|
||||||
|
[[ -d "${dir}" ]] || continue
|
||||||
|
find "${dir}" -name "*.json" -mtime +7 -delete 2>/dev/null && \
|
||||||
|
log "Cleaned ${status} actions older than 7 days" || true
|
||||||
|
done
|
||||||
|
|
||||||
|
# Deploy logs older than 30 days
|
||||||
|
local deploy_logs="${LOGS_DIR}/deploy"
|
||||||
|
if [[ -d "${deploy_logs}" ]]; then
|
||||||
|
find "${deploy_logs}" -name "*.log" -mtime +30 -delete 2>/dev/null && \
|
||||||
|
log "Cleaned deploy logs older than 30 days" || true
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Event files older than 3 days AND already past the observer checkpoint.
|
||||||
|
# The dual condition ensures we never delete an event the observer hasn't seen.
|
||||||
|
local checkpoint="${STATE_DIR}/observer_checkpoint.json"
|
||||||
|
if [[ -f "${checkpoint}" ]] && command -v python3 &>/dev/null; then
|
||||||
|
local last_processed
|
||||||
|
last_processed=$(python3 -c "
|
||||||
|
import json, sys
|
||||||
|
try:
|
||||||
|
d = json.load(open('${checkpoint}'))
|
||||||
|
print(d.get('last_processed_file', ''))
|
||||||
|
except Exception:
|
||||||
|
print('')
|
||||||
|
" 2>/dev/null || echo "")
|
||||||
|
|
||||||
|
if [[ -n "${last_processed}" ]]; then
|
||||||
|
find "${EVENTS_DIR}" -name "*.json" -mtime +3 | while IFS= read -r f; do
|
||||||
|
# Only delete files that sort before the checkpoint path
|
||||||
|
# (i.e., the observer has already processed them).
|
||||||
|
if [[ "$f" < "${last_processed}" ]]; then
|
||||||
|
rm -f "$f"
|
||||||
|
log "Cleaned old event: $(basename "$f")"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
else
|
||||||
|
log "No observer checkpoint set; skipping event file cleanup"
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Main
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
mkdir -p "${EVENTS_DIR}/${NODE_NAME}" "${STATE_DIR}"
|
||||||
|
|
||||||
|
log "Health check starting on ${NODE_NAME} (type=$(get_node_type))"
|
||||||
|
|
||||||
|
disk_pct=$(check_disk || echo 0)
|
||||||
|
mem_pct=$(check_memory || echo 0)
|
||||||
|
cpu_pct=$(check_cpu || echo 0)
|
||||||
|
check_containers
|
||||||
|
|
||||||
|
run_safe_cleanup
|
||||||
|
|
||||||
|
# VPS: also rotate control-plane filesystem artefacts
|
||||||
|
if [[ "${NODE_NAME}" == "vps" ]]; then
|
||||||
|
cleanup_control_plane_fs
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Emit a node_health heartbeat so the observer can update node status
|
||||||
|
# and the supervisor can see up-to-date resource metrics.
|
||||||
|
emit_event "node_health" "info" "" \
|
||||||
|
"Health check completed on ${NODE_NAME}" \
|
||||||
|
"{\"disk_pct\": ${disk_pct}, \"mem_pct\": ${mem_pct}, \"cpu_pct\": ${cpu_pct}}"
|
||||||
|
|
||||||
|
log "Health check complete (disk=${disk_pct}% mem=${mem_pct}% cpu=${cpu_pct}%)"
|
||||||
520
scripts/observer/observer.py
Normal file
520
scripts/observer/observer.py
Normal file
|
|
@ -0,0 +1,520 @@
|
||||||
|
import os
|
||||||
|
import json
|
||||||
|
import time
|
||||||
|
import glob
|
||||||
|
import logging
|
||||||
|
import yaml
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
|
||||||
|
def _atomic_write_json(path: Path, data) -> None:
|
||||||
|
"""Write JSON atomically: write to a sibling .tmp, fsync, then os.replace."""
|
||||||
|
tmp = path.with_suffix(".tmp")
|
||||||
|
with open(tmp, "w") as f:
|
||||||
|
json.dump(data, f, indent=2)
|
||||||
|
f.flush()
|
||||||
|
os.fsync(f.fileno())
|
||||||
|
os.replace(tmp, path)
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_ts(ts) -> float:
|
||||||
|
"""Return a Unix timestamp float from ts, which may be int/float or an ISO-8601 string.
|
||||||
|
|
||||||
|
Events from node-agent use int(time.time()); events from stability-agent / events.py
|
||||||
|
use ISO format ('2026-06-03T10:30:00Z'). Both appear in incident fields such as
|
||||||
|
last_occurrence and resolved_at, so any arithmetic on them must go through here.
|
||||||
|
Returns 0.0 on None or unparseable input so callers can use plain comparisons.
|
||||||
|
"""
|
||||||
|
if ts is None:
|
||||||
|
return 0.0
|
||||||
|
if isinstance(ts, (int, float)):
|
||||||
|
return float(ts)
|
||||||
|
try:
|
||||||
|
return datetime.fromisoformat(str(ts).replace("Z", "+00:00")).timestamp()
|
||||||
|
except Exception:
|
||||||
|
return 0.0
|
||||||
|
|
||||||
|
# Constants and Paths
|
||||||
|
RUNTIME_PATH = os.getenv("RUNTIME_PATH", "/opt/homelab")
|
||||||
|
EVENTS_DIR = Path(RUNTIME_PATH) / "events"
|
||||||
|
STATE_DIR = Path(RUNTIME_PATH) / "state"
|
||||||
|
LOGS_DIR = Path(RUNTIME_PATH) / "logs"
|
||||||
|
WORLD_DIR = Path(RUNTIME_PATH) / "world"
|
||||||
|
OBSERVER_STATE_FILE = STATE_DIR / "observer_checkpoint.json"
|
||||||
|
|
||||||
|
REPO_ROOT = Path(__file__).parent.parent.parent
|
||||||
|
INVENTORY_TOPOLOGY = REPO_ROOT / "inventory" / "topology.yaml"
|
||||||
|
|
||||||
|
# Logging setup
|
||||||
|
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
|
||||||
|
logger = logging.getLogger("observer")
|
||||||
|
|
||||||
|
class Observer:
|
||||||
|
def __init__(self):
|
||||||
|
# Per-node-directory checkpoint: {"vps": "last/file/path", "piha": "last/file/path"}
|
||||||
|
# Replaces the old single last_processed_file which silently skipped event dirs
|
||||||
|
# that sort alphabetically before the checkpoint (e.g. piha/ < vps/).
|
||||||
|
self.node_checkpoints: dict = {}
|
||||||
|
self.world_state = {
|
||||||
|
"nodes": {},
|
||||||
|
"services": {},
|
||||||
|
"deployments": {},
|
||||||
|
"incidents": {},
|
||||||
|
"summary": {
|
||||||
|
"last_update": datetime.now(timezone.utc).isoformat(),
|
||||||
|
"status": "initializing",
|
||||||
|
"active_incidents_count": 0
|
||||||
|
}
|
||||||
|
}
|
||||||
|
self.inventory = self._load_inventory()
|
||||||
|
self._ensure_dirs()
|
||||||
|
self._load_checkpoint()
|
||||||
|
|
||||||
|
def _ensure_dirs(self):
|
||||||
|
WORLD_DIR.mkdir(parents=True, exist_ok=True)
|
||||||
|
STATE_DIR.mkdir(parents=True, exist_ok=True)
|
||||||
|
EVENTS_DIR.mkdir(parents=True, exist_ok=True)
|
||||||
|
LOGS_DIR.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
def _load_inventory(self):
|
||||||
|
inventory = {"nodes": {}, "services": {}}
|
||||||
|
try:
|
||||||
|
if INVENTORY_TOPOLOGY.exists():
|
||||||
|
with open(INVENTORY_TOPOLOGY, "r") as f:
|
||||||
|
topo = yaml.safe_load(f)
|
||||||
|
for node_name, node_info in topo.get("nodes", {}).items():
|
||||||
|
inventory["nodes"][node_name] = {
|
||||||
|
"roles": node_info.get("roles", []),
|
||||||
|
"connectivity": node_info.get("connectivity", {})
|
||||||
|
}
|
||||||
|
|
||||||
|
# Load service assignments from hosts files
|
||||||
|
hosts_dir = REPO_ROOT / "hosts"
|
||||||
|
for host_dir in hosts_dir.iterdir():
|
||||||
|
if host_dir.is_dir():
|
||||||
|
svc_file = host_dir / "services.yaml"
|
||||||
|
if svc_file.exists():
|
||||||
|
with open(svc_file, "r") as f:
|
||||||
|
svc_data = yaml.safe_load(f)
|
||||||
|
host_name = svc_data.get("host")
|
||||||
|
for svc_name, svc_info in svc_data.get("services", {}).items():
|
||||||
|
if host_name not in inventory["services"]:
|
||||||
|
inventory["services"][host_name] = {}
|
||||||
|
inventory["services"][host_name][svc_name] = {
|
||||||
|
"role": svc_info.get("role"),
|
||||||
|
"exposure": svc_info.get("exposure")
|
||||||
|
}
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Failed to load inventory: {e}")
|
||||||
|
return inventory
|
||||||
|
|
||||||
|
def _load_checkpoint(self):
|
||||||
|
if OBSERVER_STATE_FILE.exists():
|
||||||
|
try:
|
||||||
|
with open(OBSERVER_STATE_FILE, "r") as f:
|
||||||
|
checkpoint = json.load(f)
|
||||||
|
|
||||||
|
if "node_checkpoints" in checkpoint:
|
||||||
|
# New format: per-directory checkpoints.
|
||||||
|
self.node_checkpoints = checkpoint["node_checkpoints"]
|
||||||
|
elif "last_processed_file" in checkpoint:
|
||||||
|
# Migrate old single-file checkpoint: extract node dir from path.
|
||||||
|
old = checkpoint["last_processed_file"]
|
||||||
|
if old:
|
||||||
|
try:
|
||||||
|
node_dir = Path(old).relative_to(EVENTS_DIR).parts[0]
|
||||||
|
self.node_checkpoints = {node_dir: old}
|
||||||
|
logger.info(f"Migrated old checkpoint → node_checkpoints: {self.node_checkpoints}")
|
||||||
|
except Exception:
|
||||||
|
pass # Bad path — start fresh
|
||||||
|
|
||||||
|
self._load_world_from_disk()
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Failed to load checkpoint: {e}")
|
||||||
|
|
||||||
|
def _load_world_from_disk(self):
|
||||||
|
# Optional: Load existing state to resume faster
|
||||||
|
files = {
|
||||||
|
"nodes": WORLD_DIR / "nodes.json",
|
||||||
|
"services": WORLD_DIR / "services.json",
|
||||||
|
"deployments": WORLD_DIR / "deployments.json",
|
||||||
|
"incidents": WORLD_DIR / "incidents.json",
|
||||||
|
"summary": WORLD_DIR / "runtime-summary.json"
|
||||||
|
}
|
||||||
|
for key, path in files.items():
|
||||||
|
if path.exists():
|
||||||
|
try:
|
||||||
|
with open(path, "r") as f:
|
||||||
|
self.world_state[key] = json.load(f)
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Failed to load {key} state: {e}")
|
||||||
|
|
||||||
|
def _save_checkpoint(self):
|
||||||
|
try:
|
||||||
|
_atomic_write_json(OBSERVER_STATE_FILE, {"node_checkpoints": self.node_checkpoints})
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Failed to save checkpoint: {e}")
|
||||||
|
|
||||||
|
def _prune_stale_world(self):
|
||||||
|
"""Remove world-state entries for nodes absent from the topology inventory.
|
||||||
|
|
||||||
|
Root cause this guards against: when NODE_NAME env var is unset, node_agent.py
|
||||||
|
falls back to socket.gethostname(), which inside a Docker container returns the
|
||||||
|
12-char hex container ID (e.g. 'be17cb6eb0f6') instead of the canonical host name
|
||||||
|
('vps'). The observer ingests those events and creates ghost entries that never
|
||||||
|
expire on their own.
|
||||||
|
|
||||||
|
Also ages out resolved incidents older than 7 days to keep world state lean.
|
||||||
|
"""
|
||||||
|
known_nodes = set(self.inventory["nodes"].keys())
|
||||||
|
if not known_nodes:
|
||||||
|
# Inventory failed to load — don't prune to avoid wiping valid state.
|
||||||
|
return
|
||||||
|
|
||||||
|
stale_nodes = [n for n in list(self.world_state["nodes"].keys())
|
||||||
|
if n not in known_nodes]
|
||||||
|
for n in stale_nodes:
|
||||||
|
logger.info(f"Pruning stale node from world state: {n}")
|
||||||
|
del self.world_state["nodes"][n]
|
||||||
|
|
||||||
|
stale_svcs = [k for k in list(self.world_state["services"].keys())
|
||||||
|
if k.split("/")[0] in stale_nodes]
|
||||||
|
for k in stale_svcs:
|
||||||
|
logger.info(f"Pruning stale service from world state: {k}")
|
||||||
|
del self.world_state["services"][k]
|
||||||
|
|
||||||
|
# Prune ghost service keys whose service-name portion is a hash-prefixed
|
||||||
|
# Docker stale-state artifact (e.g. "9e36297651e7_control-plane-observer").
|
||||||
|
# These are created when node-agent incorrectly uses c.name instead of the
|
||||||
|
# compose label, and accumulate on every container rebuild.
|
||||||
|
# Pattern: <node>/<12hexchars>_<real-name>
|
||||||
|
ghost_svcs = [
|
||||||
|
k for k in list(self.world_state["services"].keys())
|
||||||
|
if len(k.split("/", 1)) == 2
|
||||||
|
and len(k.split("/", 1)[1]) > 13
|
||||||
|
and k.split("/", 1)[1][12] == "_"
|
||||||
|
and all(ch in "0123456789abcdef" for ch in k.split("/", 1)[1][:12])
|
||||||
|
]
|
||||||
|
for k in ghost_svcs:
|
||||||
|
logger.info(f"Pruning ghost (hash-prefixed) service key from world state: {k}")
|
||||||
|
del self.world_state["services"][k]
|
||||||
|
|
||||||
|
now = time.time()
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Collect incident_ids currently referenced by any service entry.
|
||||||
|
linked_ids: set = {
|
||||||
|
svc.get("incident_id")
|
||||||
|
for svc in self.world_state["services"].values()
|
||||||
|
if svc.get("incident_id")
|
||||||
|
}
|
||||||
|
|
||||||
|
# Case 1 — service is healthy but still points at an active incident.
|
||||||
|
# process_event already calls _resolve_incident on service_healthy events,
|
||||||
|
# but if the observer restarted with on-disk state where the link was
|
||||||
|
# intact (inconsistency from a pre-atomic-write crash), it may not get
|
||||||
|
# resolved until the next service_healthy event is processed. Resolve
|
||||||
|
# immediately — a healthy service cannot have an ongoing incident.
|
||||||
|
for svc_key, svc in self.world_state["services"].items():
|
||||||
|
if svc.get("status") != "healthy":
|
||||||
|
continue
|
||||||
|
inc_id = svc.get("incident_id")
|
||||||
|
if not inc_id:
|
||||||
|
continue
|
||||||
|
inc = self.world_state["incidents"].get(inc_id, {})
|
||||||
|
if inc.get("status") == "active":
|
||||||
|
logger.info(
|
||||||
|
f"Auto-resolving incident {inc_id} for {svc_key}: "
|
||||||
|
f"service is healthy"
|
||||||
|
)
|
||||||
|
inc["status"] = "resolved"
|
||||||
|
inc["resolved_at"] = now
|
||||||
|
svc["incident_id"] = None
|
||||||
|
linked_ids.discard(inc_id)
|
||||||
|
|
||||||
|
# Case 2 — orphaned active incident: no service entry links to it and
|
||||||
|
# last_occurrence is older than 5 minutes (guard against creation races).
|
||||||
|
# These are the stale records left behind when on-disk state was
|
||||||
|
# inconsistent: the service entry had incident_id cleared but incidents.json
|
||||||
|
# still had the record as "active".
|
||||||
|
for inc_id, inc in self.world_state["incidents"].items():
|
||||||
|
if inc.get("status") != "active":
|
||||||
|
continue
|
||||||
|
if inc_id in linked_ids:
|
||||||
|
continue
|
||||||
|
age = now - _parse_ts(inc.get("last_occurrence"))
|
||||||
|
if age > 300: # 5-minute guard
|
||||||
|
logger.info(
|
||||||
|
f"Auto-resolving orphaned incident {inc_id} "
|
||||||
|
f"(service={inc.get('service')}, node={inc.get('node')}): "
|
||||||
|
f"no service references it, age={int(age)}s"
|
||||||
|
)
|
||||||
|
inc["status"] = "resolved"
|
||||||
|
inc["resolved_at"] = now
|
||||||
|
|
||||||
|
except Exception as exc:
|
||||||
|
logger.error(f"Error during incident auto-resolve in _prune_stale_world: {exc}")
|
||||||
|
|
||||||
|
# Remove resolved incidents older than 7 days.
|
||||||
|
# Use _parse_ts so ISO-string resolved_at values are handled correctly.
|
||||||
|
stale_incidents = [
|
||||||
|
k for k, v in self.world_state["incidents"].items()
|
||||||
|
if v.get("status") == "resolved"
|
||||||
|
and now - _parse_ts(v.get("resolved_at")) > 7 * 86400
|
||||||
|
]
|
||||||
|
for k in stale_incidents:
|
||||||
|
del self.world_state["incidents"][k]
|
||||||
|
|
||||||
|
def _save_world(self):
|
||||||
|
self.world_state["summary"]["last_update"] = datetime.now(timezone.utc).isoformat()
|
||||||
|
active_incidents = [
|
||||||
|
k for k, v in self.world_state["incidents"].items() if v.get("status") == "active"
|
||||||
|
]
|
||||||
|
self.world_state["summary"]["active_incidents_count"] = len(active_incidents)
|
||||||
|
self.world_state["summary"]["node_count"] = len(self.world_state["nodes"])
|
||||||
|
self.world_state["summary"]["service_count"] = len(self.world_state["services"])
|
||||||
|
|
||||||
|
if active_incidents:
|
||||||
|
self.world_state["summary"]["status"] = "degraded"
|
||||||
|
else:
|
||||||
|
self.world_state["summary"]["status"] = "nominal"
|
||||||
|
|
||||||
|
files = {
|
||||||
|
"nodes.json": self.world_state["nodes"],
|
||||||
|
"services.json": self.world_state["services"],
|
||||||
|
"deployments.json": self.world_state["deployments"],
|
||||||
|
"incidents.json": self.world_state["incidents"],
|
||||||
|
"recommendations.json": [],
|
||||||
|
"runtime-summary.json": self.world_state["summary"]
|
||||||
|
}
|
||||||
|
for filename, data in files.items():
|
||||||
|
try:
|
||||||
|
_atomic_write_json(WORLD_DIR / filename, data)
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Failed to save {filename}: {e}")
|
||||||
|
|
||||||
|
def process_event(self, event):
|
||||||
|
etype = event.get("type")
|
||||||
|
node = event.get("node")
|
||||||
|
service = event.get("service")
|
||||||
|
severity = event.get("severity")
|
||||||
|
timestamp = event.get("timestamp")
|
||||||
|
cid = event.get("correlation_id")
|
||||||
|
payload = event.get("payload", {})
|
||||||
|
|
||||||
|
# 1. Update Node State
|
||||||
|
if node not in self.world_state["nodes"]:
|
||||||
|
self.world_state["nodes"][node] = {
|
||||||
|
"status": "unknown",
|
||||||
|
"last_seen": None,
|
||||||
|
"roles": self.inventory["nodes"].get(node, {}).get("roles", [])
|
||||||
|
}
|
||||||
|
self.world_state["nodes"][node]["last_seen"] = timestamp
|
||||||
|
|
||||||
|
if etype == "node_online":
|
||||||
|
self.world_state["nodes"][node]["status"] = "online"
|
||||||
|
elif etype == "node_offline":
|
||||||
|
self.world_state["nodes"][node]["status"] = "offline"
|
||||||
|
|
||||||
|
elif etype == "node_health":
|
||||||
|
# Regular heartbeat from node-agent; updates resource metrics.
|
||||||
|
# Clears disk_pressure if disk is now healthy (< warn threshold).
|
||||||
|
self.world_state["nodes"][node]["status"] = "online"
|
||||||
|
self.world_state["nodes"][node].update({
|
||||||
|
"disk_usage_pct": payload.get("disk_pct"),
|
||||||
|
"mem_usage_pct": payload.get("mem_pct"),
|
||||||
|
"cpu_usage_pct": payload.get("cpu_pct"),
|
||||||
|
})
|
||||||
|
if (payload.get("disk_pct") or 0) < 75:
|
||||||
|
self.world_state["nodes"][node].pop("disk_pressure", None)
|
||||||
|
|
||||||
|
elif etype == "disk_pressure":
|
||||||
|
# Emitted when disk usage crosses 75 % (medium) or 85 % (high).
|
||||||
|
# The supervisor reads disk_pressure to generate disk_cleanup actions.
|
||||||
|
self.world_state["nodes"][node]["disk_pressure"] = severity
|
||||||
|
self.world_state["nodes"][node]["disk_usage_pct"] = payload.get("usage_pct")
|
||||||
|
|
||||||
|
elif etype == "high_memory":
|
||||||
|
# Memory pressure observation; recorded on the node for correlation.
|
||||||
|
# No automated action — operator decides if a container restart helps.
|
||||||
|
self.world_state["nodes"][node]["memory_pressure"] = severity
|
||||||
|
self.world_state["nodes"][node]["mem_usage_pct"] = payload.get("usage_pct")
|
||||||
|
|
||||||
|
elif etype == "high_cpu":
|
||||||
|
# CPU pressure observation; recorded for visibility.
|
||||||
|
self.world_state["nodes"][node]["cpu_pressure"] = severity
|
||||||
|
self.world_state["nodes"][node]["cpu_usage_pct"] = payload.get("usage_pct")
|
||||||
|
|
||||||
|
# 2. Update Service State
|
||||||
|
if service and service != "all":
|
||||||
|
svc_key = f"{node}/{service}"
|
||||||
|
if svc_key not in self.world_state["services"]:
|
||||||
|
self.world_state["services"][svc_key] = {
|
||||||
|
"node": node,
|
||||||
|
"service": service,
|
||||||
|
"status": "unknown",
|
||||||
|
"last_check": None,
|
||||||
|
"incident_id": None
|
||||||
|
}
|
||||||
|
self.world_state["services"][svc_key]["last_check"] = timestamp
|
||||||
|
|
||||||
|
if etype == "service_recovered":
|
||||||
|
self.world_state["services"][svc_key]["status"] = "healthy"
|
||||||
|
self._resolve_incident(svc_key, timestamp)
|
||||||
|
elif etype == "service_healthy":
|
||||||
|
# Positive confirmation from node-agent that a managed container
|
||||||
|
# is running. This keeps services.json populated so the supervisor
|
||||||
|
# can correctly detect drift (absent entry = never reported = unknown,
|
||||||
|
# not the same as confirmed missing).
|
||||||
|
# Also resolve any active incident — if a service that had been
|
||||||
|
# unhealthy/crashing is now confirmed healthy, the incident is over.
|
||||||
|
self.world_state["services"][svc_key]["status"] = "healthy"
|
||||||
|
self._resolve_incident(svc_key, timestamp)
|
||||||
|
elif etype in ["service_unhealthy", "healthcheck_failed"]:
|
||||||
|
self.world_state["services"][svc_key]["status"] = "unhealthy"
|
||||||
|
self._handle_incident(svc_key, event)
|
||||||
|
|
||||||
|
# 3. Update Deployment State
|
||||||
|
if etype.startswith("deployment_") and cid:
|
||||||
|
if cid not in self.world_state["deployments"]:
|
||||||
|
self.world_state["deployments"][cid] = {
|
||||||
|
"node": node,
|
||||||
|
"service": service,
|
||||||
|
"status": "unknown",
|
||||||
|
"started_at": None,
|
||||||
|
"finished_at": None,
|
||||||
|
"events": []
|
||||||
|
}
|
||||||
|
self.world_state["deployments"][cid]["events"].append({
|
||||||
|
"type": etype,
|
||||||
|
"timestamp": timestamp,
|
||||||
|
"payload": payload
|
||||||
|
})
|
||||||
|
if etype == "deployment_started":
|
||||||
|
self.world_state["deployments"][cid]["status"] = "in_progress"
|
||||||
|
self.world_state["deployments"][cid]["started_at"] = timestamp
|
||||||
|
elif etype == "deployment_completed":
|
||||||
|
self.world_state["deployments"][cid]["status"] = "completed"
|
||||||
|
self.world_state["deployments"][cid]["finished_at"] = timestamp
|
||||||
|
elif etype == "deployment_failed":
|
||||||
|
self.world_state["deployments"][cid]["status"] = "failed"
|
||||||
|
self.world_state["deployments"][cid]["finished_at"] = timestamp
|
||||||
|
# Deployment failure often creates an incident
|
||||||
|
self._handle_deployment_failure(event)
|
||||||
|
|
||||||
|
def _handle_incident(self, svc_key, event):
|
||||||
|
# Correlation: collapse repeated failures for the same service on the same node
|
||||||
|
active_incident = self.world_state["services"][svc_key].get("incident_id")
|
||||||
|
|
||||||
|
if active_incident and active_incident in self.world_state["incidents"]:
|
||||||
|
incident = self.world_state["incidents"][active_incident]
|
||||||
|
if incident["status"] == "active":
|
||||||
|
incident["last_occurrence"] = event["timestamp"]
|
||||||
|
incident["occurrence_count"] = incident.get("occurrence_count", 1) + 1
|
||||||
|
incident["events"].append(event["timestamp"])
|
||||||
|
return
|
||||||
|
|
||||||
|
# Create new incident
|
||||||
|
incident_id = f"inc-{int(time.time())}-{event.get('node')}-{event.get('service')}"
|
||||||
|
self.world_state["incidents"][incident_id] = {
|
||||||
|
"id": incident_id,
|
||||||
|
"node": event.get("node"),
|
||||||
|
"service": event.get("service"),
|
||||||
|
"status": "active",
|
||||||
|
"severity": event.get("severity"),
|
||||||
|
# trigger_type records the event type that opened this incident so that
|
||||||
|
# the supervisor can choose the appropriate remediation action
|
||||||
|
# (e.g. container_restart for containers_not_running / mqtt_unreachable
|
||||||
|
# vs. a full redeploy for other causes).
|
||||||
|
"trigger_type": event.get("type"),
|
||||||
|
"started_at": event.get("timestamp"),
|
||||||
|
"last_occurrence": event.get("timestamp"),
|
||||||
|
"occurrence_count": 1,
|
||||||
|
"events": [event["timestamp"]],
|
||||||
|
"correlation_id": event.get("correlation_id")
|
||||||
|
}
|
||||||
|
self.world_state["services"][svc_key]["incident_id"] = incident_id
|
||||||
|
|
||||||
|
def _resolve_incident(self, svc_key, timestamp):
|
||||||
|
incident_id = self.world_state["services"][svc_key].get("incident_id")
|
||||||
|
if incident_id and incident_id in self.world_state["incidents"]:
|
||||||
|
if self.world_state["incidents"][incident_id]["status"] == "active":
|
||||||
|
self.world_state["incidents"][incident_id]["status"] = "resolved"
|
||||||
|
self.world_state["incidents"][incident_id]["resolved_at"] = timestamp
|
||||||
|
self.world_state["services"][svc_key]["incident_id"] = None
|
||||||
|
|
||||||
|
def _handle_deployment_failure(self, event):
|
||||||
|
# Specific logic for deployment failures
|
||||||
|
svc_key = f"{event.get('node')}/{event.get('service')}"
|
||||||
|
self._handle_incident(svc_key, event)
|
||||||
|
|
||||||
|
# Link diagnostics if available in payload
|
||||||
|
incident_id = self.world_state["services"][svc_key].get("incident_id")
|
||||||
|
if incident_id and incident_id in self.world_state["incidents"]:
|
||||||
|
payload = event.get("payload", {})
|
||||||
|
if "diagnostics_file" in payload:
|
||||||
|
self.world_state["incidents"][incident_id]["diagnostics_ref"] = payload["diagnostics_file"]
|
||||||
|
elif "error" in payload:
|
||||||
|
self.world_state["incidents"][incident_id]["last_error"] = payload["error"]
|
||||||
|
|
||||||
|
def run_once(self):
|
||||||
|
# Update heartbeat
|
||||||
|
heartbeat_file = STATE_DIR / "observer.heartbeat"
|
||||||
|
try:
|
||||||
|
heartbeat_file.touch()
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Failed to touch heartbeat file: {e}")
|
||||||
|
|
||||||
|
# Collect all event files grouped by node directory.
|
||||||
|
# Per-node checkpoints are compared within each directory independently,
|
||||||
|
# so late-arriving events from remote nodes (sorted earlier in the path)
|
||||||
|
# are never skipped just because another node's checkpoint is further ahead.
|
||||||
|
all_files = sorted(glob.glob(str(EVENTS_DIR / "**" / "*.json"), recursive=True))
|
||||||
|
|
||||||
|
new_files = []
|
||||||
|
for file_path in all_files:
|
||||||
|
try:
|
||||||
|
node_dir = str(Path(file_path).relative_to(EVENTS_DIR).parts[0])
|
||||||
|
except (IndexError, ValueError):
|
||||||
|
node_dir = "__unknown__"
|
||||||
|
last_for_node = self.node_checkpoints.get(node_dir, "")
|
||||||
|
if file_path > last_for_node:
|
||||||
|
new_files.append((node_dir, file_path))
|
||||||
|
|
||||||
|
if not new_files:
|
||||||
|
# Even if no new events, prune stale entries and refresh summary freshness.
|
||||||
|
self._prune_stale_world()
|
||||||
|
self._save_world()
|
||||||
|
return
|
||||||
|
|
||||||
|
logger.info(f"Processing {len(new_files)} new events across "
|
||||||
|
f"{len({n for n, _ in new_files})} node(s)")
|
||||||
|
for node_dir, file_path in new_files:
|
||||||
|
try:
|
||||||
|
with open(file_path, "r") as f:
|
||||||
|
event = json.load(f)
|
||||||
|
self.process_event(event)
|
||||||
|
# Advance per-node checkpoint (only forward — no regression).
|
||||||
|
if file_path > self.node_checkpoints.get(node_dir, ""):
|
||||||
|
self.node_checkpoints[node_dir] = file_path
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Error processing {file_path}: {e}")
|
||||||
|
|
||||||
|
self._save_checkpoint()
|
||||||
|
self._prune_stale_world()
|
||||||
|
self._save_world()
|
||||||
|
|
||||||
|
def loop(self, interval=5):
|
||||||
|
logger.info("Starting observer loop")
|
||||||
|
while True:
|
||||||
|
self.run_once()
|
||||||
|
time.sleep(interval)
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
import sys
|
||||||
|
observer = Observer()
|
||||||
|
if "--run-once" in sys.argv:
|
||||||
|
observer.run_once()
|
||||||
|
else:
|
||||||
|
observer.loop()
|
||||||
83
scripts/observer/test_setup.sh
Normal file
83
scripts/observer/test_setup.sh
Normal file
|
|
@ -0,0 +1,83 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
mkdir -p /tmp/homelab/events/2026-05-12/saturn
|
||||||
|
mkdir -p /tmp/homelab/state
|
||||||
|
mkdir -p /tmp/homelab/logs
|
||||||
|
mkdir -p /tmp/homelab/world
|
||||||
|
|
||||||
|
cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/120000_node_online_1.json
|
||||||
|
{
|
||||||
|
"timestamp": "2026-05-12T12:00:00Z",
|
||||||
|
"node": "saturn",
|
||||||
|
"type": "node_online",
|
||||||
|
"severity": "info",
|
||||||
|
"source": "system",
|
||||||
|
"service": "all",
|
||||||
|
"correlation_id": "init",
|
||||||
|
"payload": {}
|
||||||
|
}
|
||||||
|
EOF
|
||||||
|
|
||||||
|
cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/120500_service_unhealthy_1.json
|
||||||
|
{
|
||||||
|
"timestamp": "2026-05-12T12:05:00Z",
|
||||||
|
"node": "saturn",
|
||||||
|
"type": "service_unhealthy",
|
||||||
|
"severity": "error",
|
||||||
|
"source": "healthcheck",
|
||||||
|
"service": "mosquitto",
|
||||||
|
"correlation_id": "hc-1",
|
||||||
|
"payload": {"error": "connection refused"}
|
||||||
|
}
|
||||||
|
EOF
|
||||||
|
|
||||||
|
cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/120600_service_unhealthy_2.json
|
||||||
|
{
|
||||||
|
"timestamp": "2026-05-12T12:06:00Z",
|
||||||
|
"node": "saturn",
|
||||||
|
"type": "service_unhealthy",
|
||||||
|
"severity": "error",
|
||||||
|
"source": "healthcheck",
|
||||||
|
"service": "mosquitto",
|
||||||
|
"correlation_id": "hc-2",
|
||||||
|
"payload": {"error": "connection refused"}
|
||||||
|
}
|
||||||
|
EOF
|
||||||
|
|
||||||
|
cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/121000_service_recovered_1.json
|
||||||
|
{
|
||||||
|
"timestamp": "2026-05-12T12:10:00Z",
|
||||||
|
"node": "saturn",
|
||||||
|
"type": "service_recovered",
|
||||||
|
"severity": "info",
|
||||||
|
"source": "healthcheck",
|
||||||
|
"service": "mosquitto",
|
||||||
|
"correlation_id": "hc-3",
|
||||||
|
"payload": {}
|
||||||
|
}
|
||||||
|
EOF
|
||||||
|
|
||||||
|
cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/121500_deployment_started_1.json
|
||||||
|
{
|
||||||
|
"timestamp": "2026-05-12T12:15:00Z",
|
||||||
|
"node": "saturn",
|
||||||
|
"type": "deployment_started",
|
||||||
|
"severity": "info",
|
||||||
|
"source": "deploy_agent",
|
||||||
|
"service": "mosquitto",
|
||||||
|
"correlation_id": "deploy-1",
|
||||||
|
"payload": {"version": "2.0.18"}
|
||||||
|
}
|
||||||
|
EOF
|
||||||
|
|
||||||
|
cat <<EOF > /tmp/homelab/events/2026-05-12/saturn/121600_deployment_failed_1.json
|
||||||
|
{
|
||||||
|
"timestamp": "2026-05-12T12:16:00Z",
|
||||||
|
"node": "saturn",
|
||||||
|
"type": "deployment_failed",
|
||||||
|
"severity": "error",
|
||||||
|
"source": "deploy_agent",
|
||||||
|
"service": "mosquitto",
|
||||||
|
"correlation_id": "deploy-1",
|
||||||
|
"payload": {"error": "container crash", "diagnostics_file": "/opt/homelab/logs/diagnostics-deploy-1.log"}
|
||||||
|
}
|
||||||
|
EOF
|
||||||
55
services/agent-system/README.md
Normal file
55
services/agent-system/README.md
Normal file
|
|
@ -0,0 +1,55 @@
|
||||||
|
### Agent System
|
||||||
|
Central runtime materializer and Operator Control Plane UI.
|
||||||
|
|
||||||
|
#### Components
|
||||||
|
- **Redis**: Central state store (on PIHA).
|
||||||
|
- **Runtime Materializer**: Converts Redis state to JSON files in `/opt/homelab/world`.
|
||||||
|
- **Web UI**: Exposes API endpoints and serving the Operator UI.
|
||||||
|
- **Telegram Bot**: Provides operator commands and action approvals via Telegram.
|
||||||
|
|
||||||
|
#### Configuration
|
||||||
|
Environment variables should be set in `.env` (see `env.example`).
|
||||||
|
Key variables for the Telegram Bot:
|
||||||
|
- `TELEGRAM_BOT_TOKEN`: Your bot token from @BotFather.
|
||||||
|
- `TELEGRAM_ALLOWED_USER_IDS`: Comma-separated list of authorized Telegram User IDs.
|
||||||
|
- `CONTROL_PLANE_URL`: URL to the `agent-system-webui` (default: `http://webui:8080`).
|
||||||
|
|
||||||
|
#### Telegram Commands
|
||||||
|
- `/status`: Check bot and API connectivity.
|
||||||
|
- `/summary`: System health overview.
|
||||||
|
- `/nodes`: List homelab nodes and their status.
|
||||||
|
- `/services`: Summary of services across nodes.
|
||||||
|
- `/unhealthy`: List all unhealthy components.
|
||||||
|
- `/incidents`: View active incidents.
|
||||||
|
- `/actions`: Summary of operator actions.
|
||||||
|
- `/help`: List all commands.
|
||||||
|
|
||||||
|
#### Deployment (on PIHA)
|
||||||
|
```bash
|
||||||
|
cd services/agent-system
|
||||||
|
./deploy.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Deployment (on CHELSTY)
|
||||||
|
```bash
|
||||||
|
cd services/stability-agent
|
||||||
|
docker compose up -d --build
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Verification
|
||||||
|
The `deploy.sh` script automatically verifies the local endpoints.
|
||||||
|
You can also manually check:
|
||||||
|
```bash
|
||||||
|
# Check runtime summary
|
||||||
|
curl http://localhost:18180/summary
|
||||||
|
|
||||||
|
# Check discovered nodes
|
||||||
|
curl http://localhost:18180/nodes
|
||||||
|
|
||||||
|
# Check discovered services
|
||||||
|
curl http://localhost:18180/services
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Directory Structure
|
||||||
|
- `/opt/homelab/world`: Contains materialized JSON state.
|
||||||
|
- `/opt/homelab/state`: Contains operator configuration and local heartbeats.
|
||||||
52
services/agent-system/action-model.md
Normal file
52
services/agent-system/action-model.md
Normal file
|
|
@ -0,0 +1,52 @@
|
||||||
|
### Action Approval Data Model
|
||||||
|
|
||||||
|
Actions are JSON files stored in `/opt/homelab/actions/{status}/{action_id}.json`.
|
||||||
|
|
||||||
|
#### Statuses
|
||||||
|
- `pending`: Waiting for operator approval. AI agents create actions in this state.
|
||||||
|
- `approved`: Approved by operator, ready for execution.
|
||||||
|
- `rejected`: Rejected by operator, will not be executed.
|
||||||
|
- `running`: Currently being executed by an agent (e.g. `materializer`).
|
||||||
|
- `completed`: Successfully executed.
|
||||||
|
- `failed`: Execution failed.
|
||||||
|
|
||||||
|
#### Human-in-the-Loop (HIL) Protocol
|
||||||
|
1. **Request**: Agent identifies a required change and writes a JSON to `actions/pending/`.
|
||||||
|
2. **Notification**: System notifies the human operator.
|
||||||
|
3. **Audit**: Human reviews `details.reason` and `details.diff`.
|
||||||
|
4. **Authorization**: Human moves file to `approved/`.
|
||||||
|
5. **Execution**: Agent monitors `approved/` and executes the task.
|
||||||
|
|
||||||
|
#### Schema
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"action_id": "string",
|
||||||
|
"service": "string",
|
||||||
|
"node": "string",
|
||||||
|
"type": "deploy_service | restart_service | rollback | scale",
|
||||||
|
"risk": "nominal | guarded | critical",
|
||||||
|
"status": "pending | approved | rejected | ...",
|
||||||
|
"created_at": <unix_seconds>,
|
||||||
|
"updated_at": <unix_seconds>,
|
||||||
|
"details": {
|
||||||
|
"image": "string",
|
||||||
|
"reason": "string",
|
||||||
|
"diff": "string"
|
||||||
|
},
|
||||||
|
"transition_history": [
|
||||||
|
{
|
||||||
|
"from": "string | null",
|
||||||
|
"to": "string",
|
||||||
|
"timestamp": <unix_seconds>,
|
||||||
|
"by": "string (system | operator-tg-12345 | webui)"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Workflow
|
||||||
|
1. A system component (e.g. `runtime-materializer` or a future analyzer) creates a file in `actions/pending/`.
|
||||||
|
2. `telegram-bot` detects the file, sends a message to allowed users.
|
||||||
|
3. Operator clicks "Approve" or "Reject".
|
||||||
|
4. `telegram-bot` moves the file to `actions/approved/` or `actions/rejected/` atomically, appending a transition to `transition_history`.
|
||||||
|
5. The responsible agent (e.g. `stability-agent` on the target node) picks up the `approved` action, moves it to `running`, executes it, and finally moves it to `completed` or `failed`.
|
||||||
28
services/agent-system/deploy.sh
Executable file
28
services/agent-system/deploy.sh
Executable file
|
|
@ -0,0 +1,28 @@
|
||||||
|
#!/bin/bash
|
||||||
|
set -e
|
||||||
|
|
||||||
|
echo ">>> Validating docker-compose configuration..."
|
||||||
|
docker compose config
|
||||||
|
|
||||||
|
echo ">>> Building and starting Agent System services..."
|
||||||
|
docker compose up -d --build
|
||||||
|
|
||||||
|
echo ">>> Services status:"
|
||||||
|
docker ps --filter "name=agent-system" --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"
|
||||||
|
|
||||||
|
if [ -z "$TELEGRAM_BOT_TOKEN" ]; then
|
||||||
|
echo ">>> Telegram bot status: DISABLED (token missing)"
|
||||||
|
else
|
||||||
|
echo ">>> Telegram bot status: ENABLED"
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo ">>> Verifying API endpoints..."
|
||||||
|
sleep 5 # Give it a moment to start
|
||||||
|
|
||||||
|
endpoints=("summary" "nodes" "services")
|
||||||
|
for ep in "${endpoints[@]}"; do
|
||||||
|
echo "Checking /$ep..."
|
||||||
|
curl -s -f http://localhost:18180/$ep > /dev/null && echo " OK" || echo " FAILED"
|
||||||
|
done
|
||||||
|
|
||||||
|
echo ">>> Deployment complete."
|
||||||
47
services/agent-system/docker-compose.yml
Normal file
47
services/agent-system/docker-compose.yml
Normal file
|
|
@ -0,0 +1,47 @@
|
||||||
|
services:
|
||||||
|
redis:
|
||||||
|
image: redis:7
|
||||||
|
container_name: agent-system-redis
|
||||||
|
ports:
|
||||||
|
- "6379:6379"
|
||||||
|
restart: unless-stopped
|
||||||
|
|
||||||
|
webui:
|
||||||
|
build: ./webui
|
||||||
|
container_name: agent-system-webui
|
||||||
|
ports:
|
||||||
|
- "18180:8080"
|
||||||
|
volumes:
|
||||||
|
- /opt/homelab:/opt/homelab
|
||||||
|
depends_on:
|
||||||
|
- redis
|
||||||
|
restart: unless-stopped
|
||||||
|
|
||||||
|
runtime-materializer:
|
||||||
|
build: ./runtime-materializer
|
||||||
|
container_name: agent-system-runtime-materializer
|
||||||
|
environment:
|
||||||
|
REDIS_HOST: redis
|
||||||
|
REDIS_PORT: "6379"
|
||||||
|
HOMELAB_WORLD_ROOT: /opt/homelab/world
|
||||||
|
WORLD_DIR: /opt/homelab/world
|
||||||
|
MATERIALIZE_INTERVAL: "10"
|
||||||
|
volumes:
|
||||||
|
- /opt/homelab:/opt/homelab
|
||||||
|
depends_on:
|
||||||
|
- redis
|
||||||
|
restart: unless-stopped
|
||||||
|
|
||||||
|
telegram-bot:
|
||||||
|
build: ./telegram-bot
|
||||||
|
container_name: agent-system-telegram-bot
|
||||||
|
environment:
|
||||||
|
TELEGRAM_BOT_TOKEN: ${TELEGRAM_BOT_TOKEN}
|
||||||
|
TELEGRAM_ALLOWED_USER_IDS: ${TELEGRAM_ALLOWED_USER_IDS}
|
||||||
|
CONTROL_PLANE_URL: ${CONTROL_PLANE_URL:-http://webui:8080}
|
||||||
|
ENABLE_LLM_FALLBACK: ${ENABLE_LLM_FALLBACK:-false}
|
||||||
|
OPENCLAW_BASE_URL: ${OPENCLAW_BASE_URL}
|
||||||
|
ACTIONS_ROOT: /opt/homelab/actions
|
||||||
|
volumes:
|
||||||
|
- /opt/homelab:/opt/homelab
|
||||||
|
restart: on-failure
|
||||||
19
services/agent-system/env.example
Normal file
19
services/agent-system/env.example
Normal file
|
|
@ -0,0 +1,19 @@
|
||||||
|
# Telegram Bot Configuration
|
||||||
|
# Get token from @BotFather
|
||||||
|
TELEGRAM_BOT_TOKEN=123456789:ABCdefGHIjklMNOpqrsTUVwxyz
|
||||||
|
# Comma-separated list of Telegram User IDs
|
||||||
|
TELEGRAM_ALLOWED_USER_IDS=12345678,87654321
|
||||||
|
# Local control-plane API (default is internal compose address)
|
||||||
|
CONTROL_PLANE_URL=http://webui:8080
|
||||||
|
# Optional LLM fallback logic
|
||||||
|
ENABLE_LLM_FALLBACK=false
|
||||||
|
OPENCLAW_BASE_URL=http://openclaw.internal
|
||||||
|
|
||||||
|
# Runtime Materializer Configuration
|
||||||
|
REDIS_HOST=100.108.208.3
|
||||||
|
REDIS_PORT=6379
|
||||||
|
|
||||||
|
# Paths
|
||||||
|
HOMELAB_ROOT=/opt/homelab
|
||||||
|
ACTIONS_ROOT=/opt/homelab/actions
|
||||||
|
WORLD_DIR=/opt/homelab/world
|
||||||
16
services/agent-system/runtime-materializer/Dockerfile
Normal file
16
services/agent-system/runtime-materializer/Dockerfile
Normal file
|
|
@ -0,0 +1,16 @@
|
||||||
|
FROM python:3.11-slim
|
||||||
|
|
||||||
|
WORKDIR /app
|
||||||
|
|
||||||
|
# Install redis python package as requested
|
||||||
|
RUN pip install --no-cache-dir redis
|
||||||
|
|
||||||
|
COPY materializer.py .
|
||||||
|
|
||||||
|
# Ensure the world directory exists in the container (though it will likely be a volume)
|
||||||
|
RUN mkdir -p /opt/homelab/world
|
||||||
|
|
||||||
|
# Use unbuffered output to see logs in docker
|
||||||
|
ENV PYTHONUNBUFFERED=1
|
||||||
|
|
||||||
|
CMD ["python", "materializer.py"]
|
||||||
251
services/agent-system/runtime-materializer/materializer.py
Normal file
251
services/agent-system/runtime-materializer/materializer.py
Normal file
|
|
@ -0,0 +1,251 @@
|
||||||
|
import redis
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import time
|
||||||
|
import argparse
|
||||||
|
import urllib.request
|
||||||
|
import urllib.error
|
||||||
|
from datetime import datetime
|
||||||
|
|
||||||
|
# Configuration from environment variables
|
||||||
|
REDIS_HOST = os.environ.get("REDIS_HOST", "redis")
|
||||||
|
REDIS_PORT = int(os.environ.get("REDIS_PORT", 6379))
|
||||||
|
WORLD_DIR = os.environ.get("WORLD_DIR", "/opt/homelab/world")
|
||||||
|
|
||||||
|
# When set, materialize from the control-plane HTTP API instead of Redis.
|
||||||
|
# This is the authoritative source of truth: the observer writes clean world
|
||||||
|
# state to the control-plane API, which the materializer mirrors locally so
|
||||||
|
# the webui's /snapshot (and all other endpoints) reflect the same data.
|
||||||
|
#
|
||||||
|
# Example: CONTROL_PLANE_URL=http://100.95.58.48:18180
|
||||||
|
CONTROL_PLANE_URL = os.environ.get("CONTROL_PLANE_URL", "").rstrip("/")
|
||||||
|
|
||||||
|
|
||||||
|
def get_redis_client():
|
||||||
|
"""Returns a Redis client with decoding enabled."""
|
||||||
|
return redis.Redis(
|
||||||
|
host=REDIS_HOST,
|
||||||
|
port=REDIS_PORT,
|
||||||
|
decode_responses=True,
|
||||||
|
socket_timeout=5
|
||||||
|
)
|
||||||
|
|
||||||
|
def safe_json_loads(data, default=None):
|
||||||
|
"""Safely loads JSON from a string."""
|
||||||
|
if not data:
|
||||||
|
return default
|
||||||
|
try:
|
||||||
|
if isinstance(data, (dict, list)):
|
||||||
|
return data
|
||||||
|
return json.loads(data)
|
||||||
|
except (json.JSONDecodeError, TypeError):
|
||||||
|
return data
|
||||||
|
|
||||||
|
def normalize_health(health):
|
||||||
|
"""Normalizes health values for the UI."""
|
||||||
|
if not health:
|
||||||
|
return "nominal"
|
||||||
|
h = str(health).lower()
|
||||||
|
if h in ["healthy", "ok", "running", "nominal"]:
|
||||||
|
return "nominal"
|
||||||
|
if h in ["degraded", "warning"]:
|
||||||
|
return "degraded"
|
||||||
|
return "error"
|
||||||
|
|
||||||
|
|
||||||
|
def _fetch_json(url):
|
||||||
|
"""Fetch JSON from a URL, returning parsed data or None on error."""
|
||||||
|
try:
|
||||||
|
with urllib.request.urlopen(url, timeout=10) as resp:
|
||||||
|
return json.loads(resp.read())
|
||||||
|
except Exception as e:
|
||||||
|
print(f"[{datetime.now().isoformat()}] Error fetching {url}: {e}")
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def write_json(filename, data):
|
||||||
|
path = os.path.join(WORLD_DIR, filename)
|
||||||
|
with open(path, "w") as f:
|
||||||
|
json.dump(data, f, indent=2)
|
||||||
|
|
||||||
|
|
||||||
|
def materialize_from_api():
|
||||||
|
"""Mirror world state from the control-plane API to local world files.
|
||||||
|
|
||||||
|
The control-plane observer on VPS is the single authoritative writer of
|
||||||
|
world state. By fetching from its HTTP API we get the same clean, pruned
|
||||||
|
data that the /summary endpoint serves — no stale Redis artefacts.
|
||||||
|
|
||||||
|
Returns True if all fetches succeeded and files were written, False otherwise.
|
||||||
|
"""
|
||||||
|
print(f"[{datetime.now().isoformat()}] Materializing from control-plane API: {CONTROL_PLANE_URL}")
|
||||||
|
|
||||||
|
endpoints = {
|
||||||
|
"nodes.json": f"{CONTROL_PLANE_URL}/nodes",
|
||||||
|
"services.json": f"{CONTROL_PLANE_URL}/services",
|
||||||
|
"incidents.json": f"{CONTROL_PLANE_URL}/incidents",
|
||||||
|
"deployments.json": f"{CONTROL_PLANE_URL}/deployments",
|
||||||
|
"recommendations.json":f"{CONTROL_PLANE_URL}/recommendations",
|
||||||
|
"runtime-summary.json":f"{CONTROL_PLANE_URL}/summary",
|
||||||
|
"events.json": f"{CONTROL_PLANE_URL}/events",
|
||||||
|
}
|
||||||
|
|
||||||
|
fetched = {}
|
||||||
|
for filename, url in endpoints.items():
|
||||||
|
data = _fetch_json(url)
|
||||||
|
if data is None:
|
||||||
|
print(f"[{datetime.now().isoformat()}] Aborting: failed to fetch {url}")
|
||||||
|
return False
|
||||||
|
fetched[filename] = data
|
||||||
|
|
||||||
|
os.makedirs(WORLD_DIR, exist_ok=True)
|
||||||
|
for filename, data in fetched.items():
|
||||||
|
write_json(filename, data)
|
||||||
|
|
||||||
|
svc_count = len(fetched.get("services.json") or [])
|
||||||
|
print(f"[{datetime.now().isoformat()}] Materialized from API: {svc_count} services → {WORLD_DIR}")
|
||||||
|
return True
|
||||||
|
|
||||||
|
|
||||||
|
def materialize():
|
||||||
|
"""Reads state from Redis and writes JSON files to the world directory."""
|
||||||
|
print(f"[{datetime.now().isoformat()}] Materializing world state...")
|
||||||
|
try:
|
||||||
|
r = get_redis_client()
|
||||||
|
|
||||||
|
# 1. Nodes
|
||||||
|
nodes = []
|
||||||
|
node_keys = r.keys("homelab:nodes:*")
|
||||||
|
for key in node_keys:
|
||||||
|
node_data = r.hgetall(key)
|
||||||
|
if node_data:
|
||||||
|
# Normalize health
|
||||||
|
if "health" in node_data:
|
||||||
|
node_data["health"] = normalize_health(node_data["health"])
|
||||||
|
# Parse JSON fields if they exist
|
||||||
|
if "capabilities" in node_data:
|
||||||
|
node_data["capabilities"] = safe_json_loads(node_data["capabilities"], [])
|
||||||
|
if "checks" in node_data:
|
||||||
|
node_data["checks"] = safe_json_loads(node_data["checks"], {})
|
||||||
|
nodes.append(node_data)
|
||||||
|
|
||||||
|
# 2. Services
|
||||||
|
services = []
|
||||||
|
service_keys = r.keys("homelab:services:*")
|
||||||
|
for key in service_keys:
|
||||||
|
svc_data = r.hgetall(key)
|
||||||
|
if svc_data:
|
||||||
|
# Normalize health
|
||||||
|
if "health" in svc_data:
|
||||||
|
svc_data["health"] = normalize_health(svc_data["health"])
|
||||||
|
if "dependencies" in svc_data:
|
||||||
|
svc_data["dependencies"] = safe_json_loads(svc_data["dependencies"], [])
|
||||||
|
if "recommendations" in svc_data:
|
||||||
|
svc_data["recommendations"] = safe_json_loads(svc_data["recommendations"], [])
|
||||||
|
services.append(svc_data)
|
||||||
|
|
||||||
|
# 3. Events (Stream)
|
||||||
|
events = []
|
||||||
|
try:
|
||||||
|
# Get last 100 events from the stream
|
||||||
|
raw_events = r.xrevrange("homelab:events", count=100)
|
||||||
|
for event_id, data in raw_events:
|
||||||
|
event = data.copy()
|
||||||
|
event["id"] = event_id
|
||||||
|
if "details" in event:
|
||||||
|
event["details"] = safe_json_loads(event["details"], {})
|
||||||
|
events.append(event)
|
||||||
|
except redis.exceptions.ResponseError:
|
||||||
|
# homelab:events might not be a stream or doesn't exist
|
||||||
|
pass
|
||||||
|
|
||||||
|
# 4. Incidents (Hash)
|
||||||
|
incidents = []
|
||||||
|
incident_keys = r.keys("homelab:incidents:*")
|
||||||
|
for key in incident_keys:
|
||||||
|
incident_data = r.hgetall(key)
|
||||||
|
if incident_data:
|
||||||
|
# Normalize health if present
|
||||||
|
if "health" in incident_data:
|
||||||
|
incident_data["health"] = normalize_health(incident_data["health"])
|
||||||
|
incidents.append(incident_data)
|
||||||
|
|
||||||
|
# 5. Deployments (Hash)
|
||||||
|
deployments = []
|
||||||
|
deployment_keys = r.keys("homelab:deployments:*")
|
||||||
|
for key in deployment_keys:
|
||||||
|
dep_data = r.hgetall(key)
|
||||||
|
if dep_data:
|
||||||
|
deployments.append(dep_data)
|
||||||
|
|
||||||
|
# 6. Recommendations (Hash)
|
||||||
|
recommendations = []
|
||||||
|
recommendation_keys = r.keys("homelab:recommendations:*")
|
||||||
|
for key in recommendation_keys:
|
||||||
|
rec_data = r.hgetall(key)
|
||||||
|
if rec_data:
|
||||||
|
recommendations.append(rec_data)
|
||||||
|
|
||||||
|
# 7. Runtime Summary
|
||||||
|
unhealthy_services = [s for s in services if s.get("health") != "nominal"]
|
||||||
|
active_incidents = [i for i in incidents if i.get("status") not in ["resolved", "closed"]]
|
||||||
|
|
||||||
|
status = "nominal"
|
||||||
|
if len(active_incidents) > 0 or len(unhealthy_services) > 5:
|
||||||
|
status = "error"
|
||||||
|
elif len(unhealthy_services) > 0:
|
||||||
|
status = "degraded"
|
||||||
|
|
||||||
|
summary = {
|
||||||
|
"status": status,
|
||||||
|
"timestamp": datetime.utcnow().isoformat() + "Z",
|
||||||
|
"last_update": int(time.time()),
|
||||||
|
"node_count": len(nodes),
|
||||||
|
"service_count": len(services),
|
||||||
|
"active_incidents_count": len(active_incidents),
|
||||||
|
"unhealthy_services_count": len(unhealthy_services),
|
||||||
|
"incident_count": len(incidents),
|
||||||
|
"recent_events_count": len(events),
|
||||||
|
"stale": False
|
||||||
|
}
|
||||||
|
|
||||||
|
# Ensure directory exists
|
||||||
|
os.makedirs(WORLD_DIR, exist_ok=True)
|
||||||
|
|
||||||
|
write_json("runtime-summary.json", summary)
|
||||||
|
write_json("nodes.json", nodes)
|
||||||
|
write_json("services.json", services)
|
||||||
|
write_json("incidents.json", incidents)
|
||||||
|
write_json("events.json", events)
|
||||||
|
write_json("deployments.json", deployments)
|
||||||
|
write_json("recommendations.json", recommendations)
|
||||||
|
|
||||||
|
print(f"[{datetime.now().isoformat()}] Successfully materialized to {WORLD_DIR}")
|
||||||
|
|
||||||
|
except redis.exceptions.ConnectionError as e:
|
||||||
|
print(f"Redis connection error: {e}")
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Unexpected error during materialization: {e}")
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
parser = argparse.ArgumentParser(description="Homelab Runtime Materializer")
|
||||||
|
parser.add_argument("--once", action="store_true", help="Run once and exit")
|
||||||
|
parser.add_argument("--interval", type=int, default=30, help="Sleep interval between runs (seconds)")
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
if CONTROL_PLANE_URL:
|
||||||
|
print(f"Mode: control-plane API ({CONTROL_PLANE_URL})")
|
||||||
|
run_fn = materialize_from_api
|
||||||
|
else:
|
||||||
|
print(f"Mode: Redis ({REDIS_HOST}:{REDIS_PORT})")
|
||||||
|
run_fn = materialize
|
||||||
|
|
||||||
|
interval = int(os.environ.get("MATERIALIZE_INTERVAL", args.interval))
|
||||||
|
|
||||||
|
if args.once:
|
||||||
|
run_fn()
|
||||||
|
else:
|
||||||
|
print(f"Starting materializer loop (interval: {interval}s)...")
|
||||||
|
while True:
|
||||||
|
run_fn()
|
||||||
|
time.sleep(interval)
|
||||||
39
services/agent-system/scripts/create-test-action.sh
Executable file
39
services/agent-system/scripts/create-test-action.sh
Executable file
|
|
@ -0,0 +1,39 @@
|
||||||
|
#!/bin/bash
|
||||||
|
# Script to create a test pending action for Telegram bot verification.
|
||||||
|
|
||||||
|
ACTIONS_PENDING_DIR=${ACTIONS_ROOT:-/opt/homelab/actions}/pending
|
||||||
|
mkdir -p "$ACTIONS_PENDING_DIR"
|
||||||
|
|
||||||
|
ACTION_ID="test-$(date +%s)"
|
||||||
|
FILE_PATH="$ACTIONS_PENDING_DIR/$ACTION_ID.json"
|
||||||
|
|
||||||
|
TIMESTAMP=$(date +%s)
|
||||||
|
|
||||||
|
cat <<EOF > "$FILE_PATH"
|
||||||
|
{
|
||||||
|
"action_id": "$ACTION_ID",
|
||||||
|
"service": "frigate",
|
||||||
|
"node": "chelsty",
|
||||||
|
"type": "deploy_service",
|
||||||
|
"risk": "guarded",
|
||||||
|
"status": "pending",
|
||||||
|
"created_at": $TIMESTAMP,
|
||||||
|
"updated_at": $TIMESTAMP,
|
||||||
|
"details": {
|
||||||
|
"image": "blakeblackshear/frigate:0.13.0",
|
||||||
|
"reason": "Security update for Frigate",
|
||||||
|
"diff": "image: blakeblackshear/frigate:0.12.0 -> 0.13.0"
|
||||||
|
},
|
||||||
|
"transition_history": [
|
||||||
|
{
|
||||||
|
"from": null,
|
||||||
|
"to": "pending",
|
||||||
|
"timestamp": $TIMESTAMP,
|
||||||
|
"by": "system-test"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
EOF
|
||||||
|
|
||||||
|
echo "Test action created: $FILE_PATH"
|
||||||
|
echo "If the telegram-bot is running and configured, you should receive a notification."
|
||||||
10
services/agent-system/telegram-bot/Dockerfile
Normal file
10
services/agent-system/telegram-bot/Dockerfile
Normal file
|
|
@ -0,0 +1,10 @@
|
||||||
|
FROM python:3.11-slim
|
||||||
|
|
||||||
|
WORKDIR /app
|
||||||
|
|
||||||
|
COPY requirements.txt .
|
||||||
|
RUN pip install --no-cache-dir -r requirements.txt
|
||||||
|
|
||||||
|
COPY bot.py .
|
||||||
|
|
||||||
|
CMD ["python", "bot.py"]
|
||||||
454
services/agent-system/telegram-bot/bot.py
Normal file
454
services/agent-system/telegram-bot/bot.py
Normal file
|
|
@ -0,0 +1,454 @@
|
||||||
|
import os
|
||||||
|
import json
|
||||||
|
import time
|
||||||
|
import asyncio
|
||||||
|
import logging
|
||||||
|
import urllib.request
|
||||||
|
import urllib.error
|
||||||
|
from pathlib import Path
|
||||||
|
from telegram import Update, InlineKeyboardButton, InlineKeyboardMarkup
|
||||||
|
from telegram.ext import ApplicationBuilder, ContextTypes, CommandHandler, CallbackQueryHandler, MessageHandler, filters
|
||||||
|
|
||||||
|
# Setup logging
|
||||||
|
logging.basicConfig(
|
||||||
|
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
|
||||||
|
level=logging.INFO
|
||||||
|
)
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
# Configuration
|
||||||
|
TOKEN = os.getenv("TELEGRAM_BOT_TOKEN")
|
||||||
|
ALLOWED_IDS = [int(i.strip()) for i in os.getenv("TELEGRAM_ALLOWED_USER_IDS", "").split(",") if i.strip()]
|
||||||
|
ACTIONS_ROOT = Path(os.getenv("ACTIONS_ROOT", "/opt/homelab/actions"))
|
||||||
|
CONTROL_PLANE_URL = os.getenv("CONTROL_PLANE_URL", "http://webui:8080")
|
||||||
|
ENABLE_LLM_FALLBACK = os.getenv("ENABLE_LLM_FALLBACK", "false").lower() == "true"
|
||||||
|
OPENCLAW_BASE_URL = os.getenv("OPENCLAW_BASE_URL")
|
||||||
|
|
||||||
|
async def fetch_api(path):
|
||||||
|
"""Helper to fetch JSON from the Control Plane API."""
|
||||||
|
url = f"{CONTROL_PLANE_URL.rstrip('/')}/{path.lstrip('/')}"
|
||||||
|
try:
|
||||||
|
def do_request():
|
||||||
|
req = urllib.request.Request(url)
|
||||||
|
with urllib.request.urlopen(req, timeout=5) as response:
|
||||||
|
if response.status != 200:
|
||||||
|
return None
|
||||||
|
return json.loads(response.read().decode())
|
||||||
|
return await asyncio.to_thread(do_request)
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Error fetching {url}: {e}")
|
||||||
|
return None
|
||||||
|
|
||||||
|
async def post_api(path, data):
|
||||||
|
"""Helper to POST JSON to the Control Plane API."""
|
||||||
|
url = f"{CONTROL_PLANE_URL.rstrip('/')}/{path.lstrip('/')}"
|
||||||
|
try:
|
||||||
|
body = json.dumps(data).encode("utf-8")
|
||||||
|
def do_request():
|
||||||
|
req = urllib.request.Request(url, data=body, method="POST")
|
||||||
|
req.add_header("Content-Type", "application/json")
|
||||||
|
with urllib.request.urlopen(req, timeout=5) as response:
|
||||||
|
return response.status == 200
|
||||||
|
return await asyncio.to_thread(do_request)
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Error posting to {url}: {e}")
|
||||||
|
return False
|
||||||
|
|
||||||
|
def _format_pending_action(action_id: str, data: dict) -> str:
|
||||||
|
"""Build the Telegram Markdown message for a pending action notification.
|
||||||
|
|
||||||
|
Extracted so it can be unit-tested without a live Telegram connection.
|
||||||
|
"""
|
||||||
|
# Supervisor writes risk_level; action-model.md legacy schema used risk.
|
||||||
|
risk = data.get("risk_level") or data.get("risk", "unknown")
|
||||||
|
message = (
|
||||||
|
f"⚠️ *Pending Action*\n"
|
||||||
|
f"ID: `{action_id}`\n"
|
||||||
|
f"Type: `{data.get('type', 'unknown')}`\n"
|
||||||
|
f"Service: `{data.get('service', 'unknown')}`\n"
|
||||||
|
f"Node: `{data.get('node', 'unknown')}`\n"
|
||||||
|
f"Risk: *{risk}*\n"
|
||||||
|
)
|
||||||
|
# description carries the human-readable substance of the action (required for
|
||||||
|
# alert_only actions where it is the entire operator-visible message).
|
||||||
|
description = data.get("description", "")
|
||||||
|
if description:
|
||||||
|
truncated = description[:300] + ("..." if len(description) > 300 else "")
|
||||||
|
message += f"Description: `{truncated}`\n"
|
||||||
|
# Legacy details block (old action-model.md schema) — kept for backwards compat.
|
||||||
|
if "details" in data:
|
||||||
|
details_str = json.dumps(data["details"], indent=2)
|
||||||
|
if len(details_str) > 1000:
|
||||||
|
details_str = details_str[:1000] + "..."
|
||||||
|
message += f"\nDetails:\n```json\n{details_str}\n```"
|
||||||
|
return message
|
||||||
|
|
||||||
|
|
||||||
|
class ApprovalBot:
|
||||||
|
def __init__(self):
|
||||||
|
self.pending_dir = ACTIONS_ROOT / "pending"
|
||||||
|
self.approved_dir = ACTIONS_ROOT / "approved"
|
||||||
|
self.rejected_dir = ACTIONS_ROOT / "rejected"
|
||||||
|
# Track which action IDs we have already notified in this session to avoid spam
|
||||||
|
self.notified_actions = set()
|
||||||
|
|
||||||
|
async def check_pending_actions(self, context: ContextTypes.DEFAULT_TYPE):
|
||||||
|
"""Job that periodically checks for new pending action files."""
|
||||||
|
if not self.pending_dir.exists():
|
||||||
|
return
|
||||||
|
|
||||||
|
try:
|
||||||
|
for action_file in self.pending_dir.glob("*.json"):
|
||||||
|
action_id = action_file.stem
|
||||||
|
if action_id in self.notified_actions:
|
||||||
|
continue
|
||||||
|
|
||||||
|
try:
|
||||||
|
data = json.loads(action_file.read_text())
|
||||||
|
# Only notify if it's truly pending
|
||||||
|
if data.get("status") == "pending":
|
||||||
|
await self.notify_users(context, action_id, data)
|
||||||
|
self.notified_actions.add(action_id)
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Error processing action file {action_file}: {e}")
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Error scanning pending directory: {e}")
|
||||||
|
|
||||||
|
async def notify_users(self, context: ContextTypes.DEFAULT_TYPE, action_id: str, data: dict):
|
||||||
|
"""Sends an approval request message to all allowed users."""
|
||||||
|
message = _format_pending_action(action_id, data)
|
||||||
|
|
||||||
|
keyboard = [
|
||||||
|
[
|
||||||
|
InlineKeyboardButton("✅ Approve", callback_data=f"approve:{action_id}"),
|
||||||
|
InlineKeyboardButton("❌ Reject", callback_data=f"reject:{action_id}"),
|
||||||
|
]
|
||||||
|
]
|
||||||
|
reply_markup = InlineKeyboardMarkup(keyboard)
|
||||||
|
|
||||||
|
for user_id in ALLOWED_IDS:
|
||||||
|
try:
|
||||||
|
await context.bot.send_message(
|
||||||
|
chat_id=user_id,
|
||||||
|
text=message,
|
||||||
|
parse_mode="Markdown",
|
||||||
|
reply_markup=reply_markup
|
||||||
|
)
|
||||||
|
logger.info(f"Notified user {user_id} about action {action_id}")
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Failed to notify user {user_id}: {e}")
|
||||||
|
|
||||||
|
async def handle_callback(self, update: Update, context: ContextTypes.DEFAULT_TYPE):
|
||||||
|
"""Handles button clicks for Approve/Reject."""
|
||||||
|
query = update.callback_query
|
||||||
|
user_id = query.from_user.id
|
||||||
|
|
||||||
|
if user_id not in ALLOWED_IDS:
|
||||||
|
await query.answer("Unauthorized", show_alert=True)
|
||||||
|
return
|
||||||
|
|
||||||
|
await query.answer()
|
||||||
|
|
||||||
|
cb_data = query.data
|
||||||
|
if ":" not in cb_data:
|
||||||
|
return
|
||||||
|
|
||||||
|
action, action_id = cb_data.split(":", 1)
|
||||||
|
target_status = "approved" if action == "approve" else "rejected"
|
||||||
|
|
||||||
|
# Use API for mutation if available, fallback to local disk move
|
||||||
|
success = await post_api("/action/mutate", {"id": action_id, "status": target_status})
|
||||||
|
msg = "Success" if success else "API call failed"
|
||||||
|
|
||||||
|
if not success:
|
||||||
|
# Fallback to direct disk manipulation (original behavior)
|
||||||
|
success, msg = self.move_action(action_id, target_status, user_id, query.from_user.username or str(user_id))
|
||||||
|
|
||||||
|
if success:
|
||||||
|
status_text = "✅ Approved" if target_status == "approved" else "❌ Rejected"
|
||||||
|
await query.edit_message_text(
|
||||||
|
text=query.message.text + f"\n\n{status_text} by {query.from_user.first_name}",
|
||||||
|
parse_mode="Markdown"
|
||||||
|
)
|
||||||
|
# Remove from notified list as it's no longer pending
|
||||||
|
if action_id in self.notified_actions:
|
||||||
|
self.notified_actions.remove(action_id)
|
||||||
|
else:
|
||||||
|
await query.message.reply_text(f"Failed to process action {action_id}: {msg}")
|
||||||
|
|
||||||
|
def move_action(self, action_id, target_status, user_id, username):
|
||||||
|
"""Moves action file and updates its status and history."""
|
||||||
|
source_path = self.pending_dir / f"{action_id}.json"
|
||||||
|
if not source_path.exists():
|
||||||
|
return False, "Action file no longer exists in pending."
|
||||||
|
|
||||||
|
target_dir = self.approved_dir if target_status == "approved" else self.rejected_dir
|
||||||
|
target_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
target_path = target_dir / f"{action_id}.json"
|
||||||
|
|
||||||
|
try:
|
||||||
|
data = json.loads(source_path.read_text())
|
||||||
|
current_status = data.get("status", "pending")
|
||||||
|
|
||||||
|
# Update data
|
||||||
|
data["status"] = target_status
|
||||||
|
data["updated_at"] = time.time()
|
||||||
|
|
||||||
|
history = data.get("transition_history", [])
|
||||||
|
history.append({
|
||||||
|
"from": current_status,
|
||||||
|
"to": target_status,
|
||||||
|
"timestamp": time.time(),
|
||||||
|
"by": f"tg:{username}"
|
||||||
|
})
|
||||||
|
data["transition_history"] = history
|
||||||
|
|
||||||
|
# Atomic move: write to new location, then delete old
|
||||||
|
target_path.write_text(json.dumps(data, indent=2))
|
||||||
|
source_path.unlink()
|
||||||
|
logger.info(f"Action {action_id} moved from {current_status} to {target_status} by {username}")
|
||||||
|
return True, "Success"
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Error moving action file: {e}")
|
||||||
|
return False, str(e)
|
||||||
|
|
||||||
|
async def start_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
||||||
|
"""Simple start command to help users find their ID."""
|
||||||
|
user = update.effective_user
|
||||||
|
message = (
|
||||||
|
f"Hello {user.first_name}! 🤖\n"
|
||||||
|
f"Your Telegram User ID is: `{user.id}`\n\n"
|
||||||
|
)
|
||||||
|
if user.id in ALLOWED_IDS:
|
||||||
|
message += "✅ You are authorized to manage the homelab.\n\n"
|
||||||
|
message += "Use /help to see available commands."
|
||||||
|
else:
|
||||||
|
message += "❌ You are NOT authorized. Add your ID to `TELEGRAM_ALLOWED_USER_IDS`."
|
||||||
|
|
||||||
|
await update.message.reply_text(message, parse_mode="Markdown")
|
||||||
|
|
||||||
|
async def status_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
||||||
|
if update.effective_user.id not in ALLOWED_IDS: return
|
||||||
|
res = await fetch_api("/summary")
|
||||||
|
status = "✅ Online" if res else "❌ Unreachable"
|
||||||
|
message = (
|
||||||
|
f"🤖 *Telegram Bot Status*\n"
|
||||||
|
f"Control Plane API: {status}\n"
|
||||||
|
f"Target URL: `{CONTROL_PLANE_URL}`\n"
|
||||||
|
)
|
||||||
|
await update.message.reply_text(message, parse_mode="Markdown")
|
||||||
|
|
||||||
|
async def summary_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
||||||
|
if update.effective_user.id not in ALLOWED_IDS: return
|
||||||
|
data = await fetch_api("/summary")
|
||||||
|
if not data:
|
||||||
|
await update.message.reply_text("❌ Failed to fetch summary from Control Plane.")
|
||||||
|
return
|
||||||
|
|
||||||
|
msg = "📊 *System Summary*\n"
|
||||||
|
msg += f"Status: `{data.get('status', 'unknown')}`\n"
|
||||||
|
msg += f"Nodes: {data.get('node_count', 0)}\n"
|
||||||
|
msg += f"Services: {data.get('service_count', 0)}\n"
|
||||||
|
msg += f"Active Incidents: {data.get('active_incidents_count', 0)}\n"
|
||||||
|
if data.get('stale'):
|
||||||
|
msg += "\n⚠️ *Warning: Data is stale!*"
|
||||||
|
|
||||||
|
await update.message.reply_text(msg, parse_mode="Markdown")
|
||||||
|
|
||||||
|
async def nodes_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
||||||
|
if update.effective_user.id not in ALLOWED_IDS: return
|
||||||
|
nodes = await fetch_api("/nodes")
|
||||||
|
if nodes is None:
|
||||||
|
await update.message.reply_text("❌ Failed to fetch nodes.")
|
||||||
|
return
|
||||||
|
|
||||||
|
if not nodes:
|
||||||
|
await update.message.reply_text("No nodes discovered in the fleet.")
|
||||||
|
return
|
||||||
|
|
||||||
|
msg = "🖥️ *Nodes Status*\n"
|
||||||
|
for node in nodes:
|
||||||
|
health_icon = "✅" if node.get('health') == 'nominal' else "⚠️" if node.get('health') == 'degraded' else "❌"
|
||||||
|
msg += f"{health_icon} *{node.get('hostname')}*: `{node.get('status', 'unknown')}`\n"
|
||||||
|
msg += f" Last seen: {node.get('last_seen', 'N/A')}\n"
|
||||||
|
|
||||||
|
await update.message.reply_text(msg, parse_mode="Markdown")
|
||||||
|
|
||||||
|
async def services_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
||||||
|
if update.effective_user.id not in ALLOWED_IDS: return
|
||||||
|
services = await fetch_api("/services")
|
||||||
|
if services is None:
|
||||||
|
await update.message.reply_text("❌ Failed to fetch services.")
|
||||||
|
return
|
||||||
|
|
||||||
|
# Summarize by node
|
||||||
|
nodes = {}
|
||||||
|
for s in services:
|
||||||
|
node = s.get("node", "unknown")
|
||||||
|
if node not in nodes: nodes[node] = []
|
||||||
|
nodes[node].append(s)
|
||||||
|
|
||||||
|
msg = "⚙️ *Services Summary*\n"
|
||||||
|
if not nodes:
|
||||||
|
msg += "No services discovered."
|
||||||
|
else:
|
||||||
|
for node, svc_list in sorted(nodes.items()):
|
||||||
|
nominal = len([s for s in svc_list if s.get("health") == "nominal"])
|
||||||
|
msg += f"• *{node}*: {nominal}/{len(svc_list)} nominal\n"
|
||||||
|
|
||||||
|
msg += "\nUse /unhealthy to see issues."
|
||||||
|
await update.message.reply_text(msg, parse_mode="Markdown")
|
||||||
|
|
||||||
|
async def unhealthy_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
||||||
|
if update.effective_user.id not in ALLOWED_IDS: return
|
||||||
|
services = await fetch_api("/services")
|
||||||
|
nodes = await fetch_api("/nodes")
|
||||||
|
|
||||||
|
msg = "⚠️ *Unhealthy Components*\n"
|
||||||
|
found = False
|
||||||
|
|
||||||
|
if services:
|
||||||
|
for s in services:
|
||||||
|
health = s.get("health", "").lower()
|
||||||
|
if health != "nominal":
|
||||||
|
msg += f"• Service *{s.get('name')}* on *{s.get('node')}*: `{health}`\n"
|
||||||
|
found = True
|
||||||
|
|
||||||
|
if nodes:
|
||||||
|
for n in nodes:
|
||||||
|
checks = n.get("checks", {})
|
||||||
|
if isinstance(checks, str):
|
||||||
|
try: checks = json.loads(checks)
|
||||||
|
except: checks = {}
|
||||||
|
|
||||||
|
docker = checks.get("docker", {})
|
||||||
|
if docker.get("status") == "ok":
|
||||||
|
for c in docker.get("containers", []):
|
||||||
|
if c.get("state") != "running":
|
||||||
|
msg += f"• Container *{c.get('name')}* on *{n.get('hostname')}*: `{c.get('state')}`\n"
|
||||||
|
found = True
|
||||||
|
|
||||||
|
if not found:
|
||||||
|
msg += "All systems nominal. ✅"
|
||||||
|
|
||||||
|
await update.message.reply_text(msg, parse_mode="Markdown")
|
||||||
|
|
||||||
|
async def incidents_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
||||||
|
if update.effective_user.id not in ALLOWED_IDS: return
|
||||||
|
incidents = await fetch_api("/incidents")
|
||||||
|
if incidents is None:
|
||||||
|
await update.message.reply_text("❌ Failed to fetch incidents.")
|
||||||
|
return
|
||||||
|
|
||||||
|
active = [i for i in incidents if i.get("status") not in ("resolved", "closed")]
|
||||||
|
if not active:
|
||||||
|
await update.message.reply_text("No active incidents. ✅")
|
||||||
|
return
|
||||||
|
|
||||||
|
msg = "🚨 *Active Incidents*\n"
|
||||||
|
for inc in active:
|
||||||
|
severity = inc.get('severity', 'info').upper()
|
||||||
|
msg += f"• [{severity}] *{inc.get('type')}*: {inc.get('message')}\n"
|
||||||
|
|
||||||
|
await update.message.reply_text(msg, parse_mode="Markdown")
|
||||||
|
|
||||||
|
async def actions_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
||||||
|
if update.effective_user.id not in ALLOWED_IDS: return
|
||||||
|
actions = await fetch_api("/actions")
|
||||||
|
if actions is None:
|
||||||
|
await update.message.reply_text("❌ Actions endpoint unavailable.")
|
||||||
|
return
|
||||||
|
|
||||||
|
msg = "⚡ *Actions Summary*\n"
|
||||||
|
total = 0
|
||||||
|
for status, act_list in actions.items():
|
||||||
|
if act_list:
|
||||||
|
msg += f"• {status.capitalize()}: {len(act_list)}\n"
|
||||||
|
total += len(act_list)
|
||||||
|
|
||||||
|
if total == 0:
|
||||||
|
msg = "No actions recorded."
|
||||||
|
|
||||||
|
await update.message.reply_text(msg, parse_mode="Markdown")
|
||||||
|
|
||||||
|
async def help_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
||||||
|
msg = (
|
||||||
|
"📖 *Supported Commands*\n\n"
|
||||||
|
"/status - Check bot and API connectivity\n"
|
||||||
|
"/summary - System health overview\n"
|
||||||
|
"/nodes - List homelab nodes and their status\n"
|
||||||
|
"/services - Summary of services across nodes\n"
|
||||||
|
"/unhealthy - List all unhealthy components\n"
|
||||||
|
"/incidents - View active incidents\n"
|
||||||
|
"/actions - Summary of operator actions\n"
|
||||||
|
"/help - Show this help message\n\n"
|
||||||
|
"Free text will be handled by the guidance system."
|
||||||
|
)
|
||||||
|
await update.message.reply_text(msg, parse_mode="Markdown")
|
||||||
|
|
||||||
|
async def handle_fallback(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
||||||
|
"""Handles non-command messages."""
|
||||||
|
if update.effective_user.id not in ALLOWED_IDS: return
|
||||||
|
|
||||||
|
if ENABLE_LLM_FALLBACK and OPENCLAW_BASE_URL:
|
||||||
|
# Placeholder for OpenClaw LLM fallback
|
||||||
|
# In a real scenario, this would call the LLM API
|
||||||
|
logger.info(f"LLM fallback requested for: {update.message.text}")
|
||||||
|
|
||||||
|
await update.message.reply_text(
|
||||||
|
"Use /summary, /nodes, /services, /unhealthy, /incidents, /actions."
|
||||||
|
)
|
||||||
|
|
||||||
|
async def run_bot():
|
||||||
|
if not TOKEN:
|
||||||
|
print("CRITICAL: TELEGRAM_BOT_TOKEN is not set. Telegram bot will not start.")
|
||||||
|
# Keep process alive to not crash compose if not desired, but here we just exit
|
||||||
|
# Requirement says: "do not fail if Telegram token is absent, but telegram-bot should be disabled or exit cleanly"
|
||||||
|
return
|
||||||
|
|
||||||
|
bot_logic = ApprovalBot()
|
||||||
|
|
||||||
|
application = ApplicationBuilder().token(TOKEN).build()
|
||||||
|
|
||||||
|
application.add_handler(CommandHandler("start", start_command))
|
||||||
|
application.add_handler(CommandHandler("status", status_command))
|
||||||
|
application.add_handler(CommandHandler("summary", summary_command))
|
||||||
|
application.add_handler(CommandHandler("nodes", nodes_command))
|
||||||
|
application.add_handler(CommandHandler("services", services_command))
|
||||||
|
application.add_handler(CommandHandler("unhealthy", unhealthy_command))
|
||||||
|
application.add_handler(CommandHandler("incidents", incidents_command))
|
||||||
|
application.add_handler(CommandHandler("actions", actions_command))
|
||||||
|
application.add_handler(CommandHandler("help", help_command))
|
||||||
|
|
||||||
|
application.add_handler(MessageHandler(filters.TEXT & (~filters.COMMAND), handle_fallback))
|
||||||
|
application.add_handler(CallbackQueryHandler(bot_logic.handle_callback))
|
||||||
|
|
||||||
|
# Schedule the pending actions check
|
||||||
|
job_queue = application.job_queue
|
||||||
|
if job_queue:
|
||||||
|
job_queue.run_repeating(bot_logic.check_pending_actions, interval=10, first=5)
|
||||||
|
else:
|
||||||
|
logger.warning("JobQueue is not available. Periodic pending actions check will be skipped.")
|
||||||
|
|
||||||
|
logger.info("Starting Telegram Approval Bot...")
|
||||||
|
await application.initialize()
|
||||||
|
await application.start()
|
||||||
|
await application.updater.start_polling()
|
||||||
|
|
||||||
|
# Run until the application is stopped
|
||||||
|
stop_event = asyncio.Event()
|
||||||
|
try:
|
||||||
|
await stop_event.wait()
|
||||||
|
except (KeyboardInterrupt, SystemExit):
|
||||||
|
logger.info("Stopping bot...")
|
||||||
|
finally:
|
||||||
|
await application.stop()
|
||||||
|
await application.shutdown()
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
try:
|
||||||
|
asyncio.run(run_bot())
|
||||||
|
except KeyboardInterrupt:
|
||||||
|
pass
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Fatal error: {e}")
|
||||||
1
services/agent-system/telegram-bot/requirements.txt
Normal file
1
services/agent-system/telegram-bot/requirements.txt
Normal file
|
|
@ -0,0 +1 @@
|
||||||
|
python-telegram-bot[job-queue]==20.7
|
||||||
38
services/agent-system/telegram-bot/tests/conftest.py
Normal file
38
services/agent-system/telegram-bot/tests/conftest.py
Normal file
|
|
@ -0,0 +1,38 @@
|
||||||
|
"""Stub telegram before bot.py is imported so pytest doesn't need the real package."""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import sys
|
||||||
|
import types
|
||||||
|
from unittest.mock import MagicMock
|
||||||
|
|
||||||
|
|
||||||
|
def _make_telegram_stub() -> types.ModuleType:
|
||||||
|
mod = types.ModuleType("telegram")
|
||||||
|
mod.Update = MagicMock
|
||||||
|
mod.InlineKeyboardButton = MagicMock
|
||||||
|
mod.InlineKeyboardMarkup = MagicMock
|
||||||
|
return mod
|
||||||
|
|
||||||
|
|
||||||
|
def _make_telegram_ext_stub() -> types.ModuleType:
|
||||||
|
mod = types.ModuleType("telegram.ext")
|
||||||
|
mod.ApplicationBuilder = MagicMock
|
||||||
|
|
||||||
|
# ContextTypes.DEFAULT_TYPE is referenced as a type annotation at class-body
|
||||||
|
# evaluation time, so it must be a real attribute, not a dynamic MagicMock attr.
|
||||||
|
ContextTypesMock = MagicMock()
|
||||||
|
ContextTypesMock.DEFAULT_TYPE = type(None)
|
||||||
|
mod.ContextTypes = ContextTypesMock
|
||||||
|
|
||||||
|
mod.CommandHandler = MagicMock
|
||||||
|
mod.CallbackQueryHandler = MagicMock
|
||||||
|
mod.MessageHandler = MagicMock
|
||||||
|
mod.filters = MagicMock()
|
||||||
|
return mod
|
||||||
|
|
||||||
|
|
||||||
|
# Insert before any import of bot.py
|
||||||
|
if "telegram" not in sys.modules:
|
||||||
|
sys.modules["telegram"] = _make_telegram_stub()
|
||||||
|
if "telegram.ext" not in sys.modules:
|
||||||
|
sys.modules["telegram.ext"] = _make_telegram_ext_stub()
|
||||||
116
services/agent-system/telegram-bot/tests/test_format.py
Normal file
116
services/agent-system/telegram-bot/tests/test_format.py
Normal file
|
|
@ -0,0 +1,116 @@
|
||||||
|
"""Tests for _format_pending_action — no Telegram connection required.
|
||||||
|
|
||||||
|
telegram stubs are set up in conftest.py before this module is imported.
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
sys.path.insert(0, str(Path(__file__).parent.parent))
|
||||||
|
from bot import _format_pending_action
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Bug 1 — risk_level field
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def test_risk_level_shown_when_present():
|
||||||
|
data = {
|
||||||
|
"type": "container_restart", "service": "homeassistant",
|
||||||
|
"node": "chelsty-ha", "risk_level": "low",
|
||||||
|
}
|
||||||
|
msg = _format_pending_action("container-restart-chelsty-ha-homeassistant", data)
|
||||||
|
assert "Risk: *low*" in msg
|
||||||
|
assert "unknown" not in msg
|
||||||
|
|
||||||
|
|
||||||
|
def test_risk_falls_back_to_legacy_risk_key():
|
||||||
|
data = {
|
||||||
|
"type": "redeploy", "service": "mosquitto",
|
||||||
|
"node": "chelsty-infra", "risk": "guarded",
|
||||||
|
}
|
||||||
|
msg = _format_pending_action("redeploy-chelsty-infra-mosquitto", data)
|
||||||
|
assert "Risk: *guarded*" in msg
|
||||||
|
|
||||||
|
|
||||||
|
def test_risk_unknown_when_both_absent():
|
||||||
|
data = {"type": "redeploy", "service": "foo", "node": "bar"}
|
||||||
|
msg = _format_pending_action("redeploy-bar-foo", data)
|
||||||
|
assert "Risk: *unknown*" in msg
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Bug 2 — description field
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def test_description_shown_for_alert_only():
|
||||||
|
data = {
|
||||||
|
"type": "alert_only", "service": "homeassistant",
|
||||||
|
"node": "chelsty-ha", "risk_level": "info",
|
||||||
|
"description": "3 entities unavailable for >1h",
|
||||||
|
}
|
||||||
|
msg = _format_pending_action("alert-ha-entity-unavailable-chelsty-ha", data)
|
||||||
|
assert "3 entities unavailable for >1h" in msg
|
||||||
|
assert "Description:" in msg
|
||||||
|
|
||||||
|
|
||||||
|
def test_description_shown_for_container_restart():
|
||||||
|
data = {
|
||||||
|
"type": "container_restart", "service": "homeassistant",
|
||||||
|
"node": "chelsty-ha", "risk_level": "low",
|
||||||
|
"description": "Restart 'homeassistant' on chelsty-ha: HA WebSocket unresponsive",
|
||||||
|
}
|
||||||
|
msg = _format_pending_action("container-restart-chelsty-ha-homeassistant", data)
|
||||||
|
assert "HA WebSocket unresponsive" in msg
|
||||||
|
|
||||||
|
|
||||||
|
def test_description_absent_no_crash():
|
||||||
|
data = {"type": "redeploy", "service": "foo", "node": "bar", "risk_level": "guarded"}
|
||||||
|
msg = _format_pending_action("redeploy-bar-foo", data)
|
||||||
|
assert "Description:" not in msg
|
||||||
|
assert "Risk: *guarded*" in msg
|
||||||
|
|
||||||
|
|
||||||
|
def test_description_truncated_at_300_chars():
|
||||||
|
long_desc = "x" * 400
|
||||||
|
data = {
|
||||||
|
"type": "alert_only", "service": "homeassistant",
|
||||||
|
"node": "chelsty-ha", "risk_level": "info",
|
||||||
|
"description": long_desc,
|
||||||
|
}
|
||||||
|
msg = _format_pending_action("alert-ha-foo-chelsty-ha", data)
|
||||||
|
assert "x" * 300 in msg
|
||||||
|
assert "..." in msg
|
||||||
|
assert "x" * 301 not in msg
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Combined — real HA alert_only action shape
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def test_ha_alert_only_full_action():
|
||||||
|
"""Mirrors an actual alert_only action written by supervisor._generate_ha_alert_only."""
|
||||||
|
data = {
|
||||||
|
"action_id": "alert-ha-entity-unavailable-chelsty-ha",
|
||||||
|
"type": "alert_only",
|
||||||
|
"node": "chelsty-ha",
|
||||||
|
"service": "homeassistant",
|
||||||
|
"risk_level": "info",
|
||||||
|
"confidence": 1.0,
|
||||||
|
"description": "3 entities unavailable for >1h: sensor.power, binary_sensor.window",
|
||||||
|
"status": "pending",
|
||||||
|
"payload": {
|
||||||
|
"location_tag": "chelsty",
|
||||||
|
"reason": "ha_entity_unavailable_long",
|
||||||
|
"count": 3,
|
||||||
|
},
|
||||||
|
}
|
||||||
|
msg = _format_pending_action(data["action_id"], data)
|
||||||
|
assert "alert_only" in msg
|
||||||
|
assert "chelsty-ha" in msg
|
||||||
|
assert "Risk: *info*" in msg
|
||||||
|
assert "3 entities unavailable" in msg
|
||||||
|
assert "unknown" not in msg
|
||||||
7
services/agent-system/webui/Dockerfile
Normal file
7
services/agent-system/webui/Dockerfile
Normal file
|
|
@ -0,0 +1,7 @@
|
||||||
|
FROM python:3.11-slim
|
||||||
|
|
||||||
|
WORKDIR /app
|
||||||
|
COPY web.py index.html ./
|
||||||
|
|
||||||
|
EXPOSE 8080
|
||||||
|
CMD ["python", "web.py"]
|
||||||
769
services/agent-system/webui/index.html
Normal file
769
services/agent-system/webui/index.html
Normal file
|
|
@ -0,0 +1,769 @@
|
||||||
|
<!doctype html>
|
||||||
|
<html lang="en">
|
||||||
|
<head>
|
||||||
|
<meta charset="utf-8">
|
||||||
|
<meta name="viewport" content="width=device-width, initial-scale=1">
|
||||||
|
<title>Operator Control Plane</title>
|
||||||
|
<style>
|
||||||
|
:root {
|
||||||
|
--bg-color: #0a0c0e;
|
||||||
|
--sidebar-color: #14171a;
|
||||||
|
--card-color: #1c2024;
|
||||||
|
--border-color: #2a3540;
|
||||||
|
--text-color: #e7edf3;
|
||||||
|
--text-muted: #94a3b8;
|
||||||
|
--accent-color: #3eaf7c;
|
||||||
|
--nominal: #3eaf7c;
|
||||||
|
--degraded: #e7c000;
|
||||||
|
--unstable: #e67e22;
|
||||||
|
--reconciling: #3498db;
|
||||||
|
--error: #c0392b;
|
||||||
|
--safe: #3eaf7c;
|
||||||
|
--guarded: #e67e22;
|
||||||
|
--dangerous: #c0392b;
|
||||||
|
}
|
||||||
|
|
||||||
|
body {
|
||||||
|
margin: 0;
|
||||||
|
font-family: 'Inter', system-ui, -apple-system, sans-serif;
|
||||||
|
background: var(--bg-color);
|
||||||
|
color: var(--text-color);
|
||||||
|
display: flex;
|
||||||
|
height: 100vh;
|
||||||
|
overflow: hidden;
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Sidebar */
|
||||||
|
.sidebar {
|
||||||
|
width: 240px;
|
||||||
|
background: var(--sidebar-color);
|
||||||
|
border-right: 1px solid var(--border-color);
|
||||||
|
display: flex;
|
||||||
|
flex-direction: column;
|
||||||
|
flex-shrink: 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
.sidebar-header {
|
||||||
|
padding: 24px;
|
||||||
|
font-weight: 800;
|
||||||
|
font-size: 14px;
|
||||||
|
letter-spacing: 0.1em;
|
||||||
|
color: var(--accent-color);
|
||||||
|
border-bottom: 1px solid var(--border-color);
|
||||||
|
}
|
||||||
|
|
||||||
|
.nav-list {
|
||||||
|
list-style: none;
|
||||||
|
padding: 12px 0;
|
||||||
|
margin: 0;
|
||||||
|
flex-grow: 1;
|
||||||
|
}
|
||||||
|
|
||||||
|
.nav-item {
|
||||||
|
padding: 12px 24px;
|
||||||
|
cursor: pointer;
|
||||||
|
font-size: 14px;
|
||||||
|
color: var(--text-muted);
|
||||||
|
transition: all 0.2s;
|
||||||
|
display: flex;
|
||||||
|
align-items: center;
|
||||||
|
gap: 12px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.nav-item:hover {
|
||||||
|
background: rgba(255, 255, 255, 0.05);
|
||||||
|
color: var(--text-color);
|
||||||
|
}
|
||||||
|
|
||||||
|
.nav-item.active {
|
||||||
|
background: rgba(62, 175, 124, 0.1);
|
||||||
|
color: var(--accent-color);
|
||||||
|
border-left: 3px solid var(--accent-color);
|
||||||
|
}
|
||||||
|
|
||||||
|
.sidebar-footer {
|
||||||
|
padding: 16px;
|
||||||
|
border-top: 1px solid var(--border-color);
|
||||||
|
font-size: 12px;
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Content Area */
|
||||||
|
.main-content {
|
||||||
|
flex-grow: 1;
|
||||||
|
display: flex;
|
||||||
|
flex-direction: column;
|
||||||
|
overflow: hidden;
|
||||||
|
}
|
||||||
|
|
||||||
|
header {
|
||||||
|
height: 64px;
|
||||||
|
border-bottom: 1px solid var(--border-color);
|
||||||
|
display: flex;
|
||||||
|
align-items: center;
|
||||||
|
padding: 0 24px;
|
||||||
|
justify-content: space-between;
|
||||||
|
background: var(--bg-color);
|
||||||
|
}
|
||||||
|
|
||||||
|
.view-title {
|
||||||
|
font-size: 18px;
|
||||||
|
font-weight: 600;
|
||||||
|
}
|
||||||
|
|
||||||
|
.content-scroll {
|
||||||
|
flex-grow: 1;
|
||||||
|
overflow-y: auto;
|
||||||
|
padding: 24px;
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Cards & Grids */
|
||||||
|
.grid {
|
||||||
|
display: grid;
|
||||||
|
grid-template-columns: repeat(auto-fill, minmax(350px, 1fr));
|
||||||
|
gap: 20px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.card {
|
||||||
|
background: var(--card-color);
|
||||||
|
border: 1px solid var(--border-color);
|
||||||
|
padding: 20px;
|
||||||
|
border-radius: 4px;
|
||||||
|
position: relative;
|
||||||
|
}
|
||||||
|
|
||||||
|
.card-header {
|
||||||
|
display: flex;
|
||||||
|
justify-content: space-between;
|
||||||
|
align-items: center;
|
||||||
|
margin-bottom: 16px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.card-title {
|
||||||
|
font-weight: 700;
|
||||||
|
font-size: 16px;
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Status Badges */
|
||||||
|
.badge {
|
||||||
|
padding: 4px 8px;
|
||||||
|
border-radius: 4px;
|
||||||
|
font-size: 11px;
|
||||||
|
font-weight: 700;
|
||||||
|
text-transform: uppercase;
|
||||||
|
}
|
||||||
|
|
||||||
|
.status-nominal { background: rgba(62, 175, 124, 0.1); color: var(--nominal); }
|
||||||
|
.status-degraded { background: rgba(231, 192, 0, 0.1); color: var(--degraded); }
|
||||||
|
.status-unstable { background: rgba(230, 126, 34, 0.1); color: var(--unstable); }
|
||||||
|
.status-reconciling { background: rgba(52, 152, 219, 0.1); color: var(--reconciling); }
|
||||||
|
.status-error { background: rgba(192, 57, 43, 0.1); color: var(--error); }
|
||||||
|
|
||||||
|
/* Timeline */
|
||||||
|
.timeline {
|
||||||
|
display: flex;
|
||||||
|
flex-direction: column;
|
||||||
|
gap: 12px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.event {
|
||||||
|
padding: 12px;
|
||||||
|
border-left: 2px solid var(--border-color);
|
||||||
|
background: rgba(255, 255, 255, 0.02);
|
||||||
|
font-family: ui-monospace, monospace;
|
||||||
|
font-size: 13px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.event.high { border-left-color: var(--error); }
|
||||||
|
.event.medium { border-left-color: var(--unstable); }
|
||||||
|
.event.low { border-left-color: var(--nominal); }
|
||||||
|
|
||||||
|
.event-header {
|
||||||
|
display: flex;
|
||||||
|
justify-content: space-between;
|
||||||
|
margin-bottom: 4px;
|
||||||
|
color: var(--text-muted);
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Forms & Inputs */
|
||||||
|
.controls {
|
||||||
|
display: flex;
|
||||||
|
gap: 12px;
|
||||||
|
margin-top: 20px;
|
||||||
|
}
|
||||||
|
|
||||||
|
input, button {
|
||||||
|
background: var(--card-color);
|
||||||
|
border: 1px solid var(--border-color);
|
||||||
|
color: var(--text-color);
|
||||||
|
padding: 8px 16px;
|
||||||
|
font-size: 14px;
|
||||||
|
border-radius: 4px;
|
||||||
|
}
|
||||||
|
|
||||||
|
button {
|
||||||
|
cursor: pointer;
|
||||||
|
font-weight: 600;
|
||||||
|
}
|
||||||
|
|
||||||
|
button:hover { background: var(--border-color); }
|
||||||
|
|
||||||
|
.btn-primary { background: var(--accent-color); color: white; border: none; }
|
||||||
|
.btn-primary:hover { background: #359b6d; }
|
||||||
|
|
||||||
|
/* Utility */
|
||||||
|
.hidden { display: none !important; }
|
||||||
|
.mono { font-family: ui-monospace, monospace; }
|
||||||
|
.label { color: var(--text-muted); font-size: 12px; margin-bottom: 4px; }
|
||||||
|
.value { font-weight: 500; margin-bottom: 12px; }
|
||||||
|
|
||||||
|
.risk-safe { background: rgba(62, 175, 124, 0.1); color: var(--safe); }
|
||||||
|
.risk-guarded { background: rgba(230, 126, 34, 0.1); color: var(--guarded); }
|
||||||
|
.risk-dangerous { background: rgba(192, 57, 43, 0.1); color: var(--dangerous); }
|
||||||
|
|
||||||
|
</style>
|
||||||
|
</head>
|
||||||
|
<body>
|
||||||
|
<aside class="sidebar">
|
||||||
|
<div class="sidebar-header">HOMELAB OPERATOR</div>
|
||||||
|
<ul class="nav-list">
|
||||||
|
<li class="nav-item active" onclick="showView('dashboard', this)">
|
||||||
|
<span>Dashboard</span>
|
||||||
|
</li>
|
||||||
|
<li class="nav-item" onclick="showView('actions', this)">
|
||||||
|
<span>Action Queue</span>
|
||||||
|
</li>
|
||||||
|
<li class="nav-item" onclick="showView('nodes', this)">
|
||||||
|
<span>Nodes</span>
|
||||||
|
</li>
|
||||||
|
<li class="nav-item" onclick="showView('services', this)">
|
||||||
|
<span>Services</span>
|
||||||
|
</li>
|
||||||
|
<li class="nav-item" onclick="showView('deployments', this)">
|
||||||
|
<span>Deployments</span>
|
||||||
|
</li>
|
||||||
|
<li class="nav-item" onclick="showView('topology', this)">
|
||||||
|
<span>Topology</span>
|
||||||
|
</li>
|
||||||
|
<li class="nav-item" onclick="showView('events', this)">
|
||||||
|
<span>Events</span>
|
||||||
|
</li>
|
||||||
|
<li class="nav-item" onclick="showView('correlation', this)">
|
||||||
|
<span>Correlation</span>
|
||||||
|
</li>
|
||||||
|
<li class="nav-item" onclick="showView('recommendations', this)">
|
||||||
|
<span>Recommendations</span>
|
||||||
|
</li>
|
||||||
|
<li class="nav-item" onclick="showView('settings', this)">
|
||||||
|
<span>Settings</span>
|
||||||
|
</li>
|
||||||
|
</ul>
|
||||||
|
<div class="sidebar-footer">
|
||||||
|
<div id="summary-status">System Status: Loading...</div>
|
||||||
|
</div>
|
||||||
|
</aside>
|
||||||
|
|
||||||
|
<main class="main-content">
|
||||||
|
<div id="stale-banner" class="hidden" style="background:var(--error); color:white; padding:8px 24px; font-weight:bold; font-size:12px; text-align:center; letter-spacing:0.05em">
|
||||||
|
RUNTIME STATE IS STALE
|
||||||
|
</div>
|
||||||
|
<header>
|
||||||
|
<div style="display:flex; align-items:center; gap:20px">
|
||||||
|
<div class="view-title" id="current-view-title">Dashboard</div>
|
||||||
|
<select id="operator-mode" onchange="setOperatorMode(this.value)" style="background:var(--sidebar-color); border:1px solid var(--border-color); color:var(--accent-color); font-weight:bold; font-size:12px; padding:4px 8px">
|
||||||
|
<option value="observe">OBSERVE</option>
|
||||||
|
<option value="recommend">RECOMMEND</option>
|
||||||
|
<option value="approval" selected>APPROVAL</option>
|
||||||
|
<option value="autonomous">AUTONOMOUS</option>
|
||||||
|
<option value="maintenance">MAINTENANCE</option>
|
||||||
|
</select>
|
||||||
|
</div>
|
||||||
|
<div class="header-actions" style="display:flex; gap:8px; align-items:center">
|
||||||
|
<button onclick="refreshData()">Refresh</button>
|
||||||
|
<button id="copy-ai-btn" onclick="copyForAI()">Copy for AI</button>
|
||||||
|
</div>
|
||||||
|
</header>
|
||||||
|
|
||||||
|
<div class="content-scroll">
|
||||||
|
<!-- Dashboard View -->
|
||||||
|
<div id="view-dashboard" class="view">
|
||||||
|
<div class="grid">
|
||||||
|
<div class="card">
|
||||||
|
<div class="card-title">System Overview</div>
|
||||||
|
<div id="dashboard-summary" style="margin-top:20px"></div>
|
||||||
|
</div>
|
||||||
|
<div class="card">
|
||||||
|
<div class="card-title">Pending Actions</div>
|
||||||
|
<div id="dashboard-actions-summary" style="margin-top:20px"></div>
|
||||||
|
</div>
|
||||||
|
<div class="card">
|
||||||
|
<div class="card-title">Active Incidents</div>
|
||||||
|
<div id="dashboard-incidents" style="margin-top:20px"></div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<!-- Actions View -->
|
||||||
|
<div id="view-actions" class="view hidden">
|
||||||
|
<div style="display:grid; grid-template-columns: 1fr 1fr; gap:24px">
|
||||||
|
<div>
|
||||||
|
<h3>Pending Approval</h3>
|
||||||
|
<div id="actions-pending" class="timeline"></div>
|
||||||
|
</div>
|
||||||
|
<div>
|
||||||
|
<h3>Active / History</h3>
|
||||||
|
<div id="actions-history" class="timeline"></div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<!-- Nodes View -->
|
||||||
|
<div id="view-nodes" class="view hidden">
|
||||||
|
<div class="grid" id="nodes-list"></div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<!-- Services View -->
|
||||||
|
<div id="view-services" class="view hidden">
|
||||||
|
<div class="grid" id="services-list"></div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<!-- Deployments View -->
|
||||||
|
<div id="view-deployments" class="view hidden">
|
||||||
|
<div class="grid" id="deployments-list"></div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<!-- Topology View -->
|
||||||
|
<div id="view-topology" class="view hidden">
|
||||||
|
<div class="card" style="min-height:500px">
|
||||||
|
<div class="card-title">Runtime Topology</div>
|
||||||
|
<div id="topology-map" style="margin-top:20px; display:flex; flex-wrap:wrap; gap:40px; justify-content:center"></div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<!-- Events View -->
|
||||||
|
<div id="view-events" class="view hidden">
|
||||||
|
<div class="timeline" id="events-timeline"></div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<!-- Correlation View -->
|
||||||
|
<div id="view-correlation" class="view hidden">
|
||||||
|
<div id="correlation-chains" class="grid"></div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<!-- Recommendations View -->
|
||||||
|
<div id="view-recommendations" class="view hidden">
|
||||||
|
<div class="grid" id="recommendations-list"></div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<!-- Settings View -->
|
||||||
|
<div id="view-settings" class="view hidden">
|
||||||
|
<div class="card">
|
||||||
|
<div class="card-title">Configuration</div>
|
||||||
|
<div id="settings-content" style="margin-top:20px"></div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
</main>
|
||||||
|
|
||||||
|
<script>
|
||||||
|
let currentView = 'dashboard';
|
||||||
|
const pollInterval = 5000;
|
||||||
|
|
||||||
|
function showView(viewId, el) {
|
||||||
|
document.querySelectorAll('.view').forEach(v => v.classList.add('hidden'));
|
||||||
|
document.getElementById('view-' + viewId).classList.remove('hidden');
|
||||||
|
document.querySelectorAll('.nav-item').forEach(i => i.classList.remove('active'));
|
||||||
|
if (el) el.classList.add('active');
|
||||||
|
currentView = viewId;
|
||||||
|
document.getElementById('current-view-title').textContent = viewId.charAt(0).toUpperCase() + viewId.slice(1);
|
||||||
|
refreshData();
|
||||||
|
}
|
||||||
|
|
||||||
|
async function fetchData(endpoint) {
|
||||||
|
try {
|
||||||
|
const res = await fetch(endpoint, {cache: 'no-store'});
|
||||||
|
return await res.json();
|
||||||
|
} catch (e) {
|
||||||
|
console.error('Fetch error:', endpoint, e);
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
async function postData(endpoint, data) {
|
||||||
|
try {
|
||||||
|
const res = await fetch(endpoint, {
|
||||||
|
method: 'POST',
|
||||||
|
headers: {'Content-Type': 'application/json'},
|
||||||
|
body: JSON.stringify(data)
|
||||||
|
});
|
||||||
|
return await res.json();
|
||||||
|
} catch (e) {
|
||||||
|
console.error('Post error:', endpoint, e);
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
async function mutateAction(id, status) {
|
||||||
|
const res = await postData('/action/mutate', {id, status});
|
||||||
|
if (res && res.status === 'ok') {
|
||||||
|
refreshData();
|
||||||
|
} else {
|
||||||
|
alert('Mutation failed');
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
async function setOperatorMode(mode) {
|
||||||
|
console.log('Operator mode set to:', mode);
|
||||||
|
const res = await postData('/mode', {mode});
|
||||||
|
if (res && res.status === 'ok') {
|
||||||
|
console.log('Mode updated successfully');
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
function formatTime(ts) {
|
||||||
|
if (!ts) return 'N/A';
|
||||||
|
return new Date(ts * 1000).toLocaleString();
|
||||||
|
}
|
||||||
|
|
||||||
|
function getStatusClass(status) {
|
||||||
|
status = (status || '').toLowerCase();
|
||||||
|
if (['nominal', 'healthy', 'ok', 'up'].includes(status)) return 'status-nominal';
|
||||||
|
if (['degraded', 'warning'].includes(status)) return 'status-degraded';
|
||||||
|
if (['unstable'].includes(status)) return 'status-unstable';
|
||||||
|
if (['reconciling'].includes(status)) return 'status-reconciling';
|
||||||
|
if (['error', 'down', 'failed'].includes(status)) return 'status-error';
|
||||||
|
return '';
|
||||||
|
}
|
||||||
|
|
||||||
|
async function refreshData() {
|
||||||
|
// Refresh summary always
|
||||||
|
const summary = await fetchData('/summary');
|
||||||
|
if (summary) {
|
||||||
|
const statusEl = document.getElementById('summary-status');
|
||||||
|
statusEl.textContent = `System Status: ${summary.status.toUpperCase()}`;
|
||||||
|
statusEl.className = 'sidebar-footer ' + getStatusClass(summary.status);
|
||||||
|
|
||||||
|
// Handle stale state
|
||||||
|
const staleBanner = document.getElementById('stale-banner');
|
||||||
|
if (summary.stale) {
|
||||||
|
staleBanner.classList.remove('hidden');
|
||||||
|
staleBanner.textContent = `CRITICAL: Runtime state is STALE (Last update: ${formatTime(summary.last_update)})`;
|
||||||
|
} else {
|
||||||
|
staleBanner.classList.add('hidden');
|
||||||
|
}
|
||||||
|
|
||||||
|
if (currentView === 'dashboard') {
|
||||||
|
const dashSummary = document.getElementById('dashboard-summary');
|
||||||
|
dashSummary.innerHTML = `
|
||||||
|
<div class="label">Nodes</div><div class="value">${summary.node_count}</div>
|
||||||
|
<div class="label">Services</div><div class="value">${summary.service_count}</div>
|
||||||
|
<div class="label">Last Update</div><div class="value">${formatTime(summary.last_update)}</div>
|
||||||
|
`;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if (currentView === 'dashboard' || currentView === 'actions') {
|
||||||
|
const actions = await fetchData('/actions');
|
||||||
|
if (actions) {
|
||||||
|
if (currentView === 'dashboard') {
|
||||||
|
const dashActions = document.getElementById('dashboard-actions-summary');
|
||||||
|
const pendingCount = actions.pending.length;
|
||||||
|
dashActions.innerHTML = `
|
||||||
|
<div class="label">Pending</div><div class="value" style="color:var(--guarded)">${pendingCount}</div>
|
||||||
|
<div class="label">Running</div><div class="value" style="color:var(--reconciling)">${actions.running.length}</div>
|
||||||
|
`;
|
||||||
|
}
|
||||||
|
if (currentView === 'actions') {
|
||||||
|
const pendingEl = document.getElementById('actions-pending');
|
||||||
|
const historyEl = document.getElementById('actions-history');
|
||||||
|
|
||||||
|
pendingEl.innerHTML = actions.pending.map(a => `
|
||||||
|
<div class="card" style="margin-bottom:12px">
|
||||||
|
<div class="card-header">
|
||||||
|
<div class="card-title">${(a.action_type || a.type || 'unknown').toUpperCase()}</div>
|
||||||
|
<span class="badge risk-${a.risk_level}">${a.risk_level}</span>
|
||||||
|
</div>
|
||||||
|
<p>${a.description || a.action_type || 'No description'}</p>
|
||||||
|
<div class="label">Target</div><div class="value">${a.node || (a.target && a.target.node) || 'unknown'} ${(a.service || (a.target && a.target.service)) || ''}</div>
|
||||||
|
<div class="label">Confidence</div><div class="value">${Math.round((a.confidence || 0)*100)}%</div>
|
||||||
|
<div class="controls">
|
||||||
|
<button class="btn-primary" onclick="mutateAction('${a.id}', 'approved')">Approve</button>
|
||||||
|
<button onclick="mutateAction('${a.id}', 'rejected')">Reject</button>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
`).join('') || 'No pending actions.';
|
||||||
|
|
||||||
|
const history = [...actions.approved, ...actions.running, ...actions.completed, ...actions.failed, ...actions.rejected];
|
||||||
|
historyEl.innerHTML = history.sort((a,b) => (b.timestamp || b.updated_at || 0) - (a.timestamp || a.updated_at || 0)).map(a => `
|
||||||
|
<div class="event">
|
||||||
|
<div class="event-header">
|
||||||
|
<span>${(a.action_type || a.type || 'unknown').toUpperCase()}</span>
|
||||||
|
<span class="badge ${getStatusClass(a.status)}">${a.status}</span>
|
||||||
|
</div>
|
||||||
|
<div>${a.description || a.action_type || 'No description'}</div>
|
||||||
|
<small>${formatTime(a.timestamp || a.updated_at)} | Target: ${a.node || (a.target && a.target.node)}</small>
|
||||||
|
${a.status === 'approved' ? `<div class="controls"><button class="btn-primary" onclick="mutateAction('${a.id}', 'running')">Execute</button></div>` : ''}
|
||||||
|
${a.transition_history ? `
|
||||||
|
<div style="margin-top:8px; font-size:10px; color:var(--text-muted)">
|
||||||
|
<strong>Trace:</strong> ${a.transition_history.map(h => `${h.from}->${h.to}`).join(' → ')}
|
||||||
|
</div>
|
||||||
|
` : ''}
|
||||||
|
</div>
|
||||||
|
`).join('') || 'No history.';
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if (currentView === 'dashboard' || currentView === 'events') {
|
||||||
|
const incidents = await fetchData('/incidents');
|
||||||
|
if (currentView === 'dashboard') {
|
||||||
|
const dashIncidents = document.getElementById('dashboard-incidents');
|
||||||
|
if (!incidents || incidents.length === 0) {
|
||||||
|
dashIncidents.textContent = 'No active incidents.';
|
||||||
|
} else {
|
||||||
|
dashIncidents.innerHTML = incidents.map(inc => `
|
||||||
|
<div class="event ${inc.severity}">
|
||||||
|
<strong>${inc.severity.toUpperCase()}:</strong> ${inc.message}<br>
|
||||||
|
<small>${formatTime(inc.timestamp)} | Node: ${inc.node}</small>
|
||||||
|
</div>
|
||||||
|
`).join('');
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if (currentView === 'nodes') {
|
||||||
|
const nodes = await fetchData('/nodes');
|
||||||
|
const list = document.getElementById('nodes-list');
|
||||||
|
list.innerHTML = nodes.map(node => `
|
||||||
|
<div class="card">
|
||||||
|
<div class="card-header">
|
||||||
|
<div class="card-title">${node.hostname}</div>
|
||||||
|
<span class="badge ${getStatusClass(node.health)}">${node.health}</span>
|
||||||
|
</div>
|
||||||
|
<div class="label">ID</div><div class="value mono">${node.id}</div>
|
||||||
|
<div class="label">Capabilities</div><div class="value">${node.capabilities.join(', ')}</div>
|
||||||
|
<div class="label">Connectivity</div><div class="value">${node.connectivity}</div>
|
||||||
|
<div class="label">Incidents (24h)</div><div class="value">${node.incidents}</div>
|
||||||
|
<div class="label">Last Seen</div><div class="value">${formatTime(node.last_seen)}</div>
|
||||||
|
<div class="label">Runtime Status</div><div class="value">${node.status}</div>
|
||||||
|
</div>
|
||||||
|
`).join('');
|
||||||
|
}
|
||||||
|
|
||||||
|
if (currentView === 'services') {
|
||||||
|
const services = await fetchData('/services');
|
||||||
|
const list = document.getElementById('services-list');
|
||||||
|
list.innerHTML = services.map(svc => `
|
||||||
|
<div class="card">
|
||||||
|
<div class="card-header">
|
||||||
|
<div class="card-title">${svc.name}</div>
|
||||||
|
<span class="badge ${getStatusClass(svc.health)}">${svc.health}</span>
|
||||||
|
</div>
|
||||||
|
<div class="label">State (Desired/Actual)</div><div class="value">${svc.desired_state} / ${svc.actual_state}</div>
|
||||||
|
<div class="label">Deployment</div><div class="value">${svc.deployment_state}</div>
|
||||||
|
<div class="label">Dependencies</div><div class="value">${svc.dependencies.join(', ') || 'None'}</div>
|
||||||
|
<div class="label">Recommendations</div><div class="value">${svc.recommendations.join(', ') || 'None'}</div>
|
||||||
|
</div>
|
||||||
|
`).join('');
|
||||||
|
}
|
||||||
|
|
||||||
|
if (currentView === 'deployments') {
|
||||||
|
const deps = await fetchData('/deployments');
|
||||||
|
const list = document.getElementById('deployments-list');
|
||||||
|
list.innerHTML = deps.map(dep => `
|
||||||
|
<div class="card">
|
||||||
|
<div class="card-header">
|
||||||
|
<div class="card-title">${dep.service}</div>
|
||||||
|
<span class="badge ${dep.status === 'failed' ? 'status-error' : 'status-reconciling'}">${dep.status}</span>
|
||||||
|
</div>
|
||||||
|
<div class="label">ID</div><div class="value mono">${dep.id}</div>
|
||||||
|
<div class="label">Stage</div><div class="value">${dep.stage}</div>
|
||||||
|
<div class="label">Diagnostics</div><div class="value">${dep.diagnostics || 'No data'}</div>
|
||||||
|
<div class="label">Resumable</div><div class="value">${dep.resumable ? 'Yes' : 'No'}</div>
|
||||||
|
${dep.resumable ? '<button class="btn-primary">Resume</button>' : ''}
|
||||||
|
</div>
|
||||||
|
`).join('');
|
||||||
|
}
|
||||||
|
|
||||||
|
if (currentView === 'events') {
|
||||||
|
const events = await fetchData('/events');
|
||||||
|
const timeline = document.getElementById('events-timeline');
|
||||||
|
timeline.innerHTML = events.map(ev => `
|
||||||
|
<div class="event ${ev.severity}">
|
||||||
|
<div class="event-header">
|
||||||
|
<span>${ev.type.toUpperCase()}</span>
|
||||||
|
<span>${formatTime(ev.timestamp)}</span>
|
||||||
|
</div>
|
||||||
|
<div>${ev.message}</div>
|
||||||
|
<div class="label" style="margin-top:8px">Node: ${ev.node} ${ev.service ? '| Service: ' + ev.service : ''}</div>
|
||||||
|
</div>
|
||||||
|
`).join('');
|
||||||
|
}
|
||||||
|
|
||||||
|
if (currentView === 'recommendations') {
|
||||||
|
const recs = await fetchData('/recommendations');
|
||||||
|
const list = document.getElementById('recommendations-list');
|
||||||
|
list.innerHTML = recs.map(rec => `
|
||||||
|
<div class="card">
|
||||||
|
<div class="card-header">
|
||||||
|
<div class="card-title">${rec.title}</div>
|
||||||
|
<span class="badge risk-${rec.risk_level}">${rec.risk_level}</span>
|
||||||
|
</div>
|
||||||
|
<p>${rec.description}</p>
|
||||||
|
<div class="label">Confidence</div><div class="value">${Math.round(rec.confidence * 100)}%</div>
|
||||||
|
<div class="label">Autonomous Eligible</div><div class="value">${rec.autonomous_eligible ? 'Yes' : 'No'}</div>
|
||||||
|
<div class="label">Blocked Actions</div><div class="value">${rec.blocked_actions.join(', ') || 'None'}</div>
|
||||||
|
<div class="controls">
|
||||||
|
<button class="btn-primary" ${rec.risk_level === 'dangerous' ? 'style="background:var(--dangerous)"' : ''}>Approve Action</button>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
`).join('');
|
||||||
|
}
|
||||||
|
|
||||||
|
if (currentView === 'topology') {
|
||||||
|
const nodes = await fetchData('/nodes');
|
||||||
|
const services = await fetchData('/services');
|
||||||
|
const topMap = document.getElementById('topology-map');
|
||||||
|
if (nodes && services) {
|
||||||
|
topMap.innerHTML = nodes.map(node => {
|
||||||
|
const nodeServices = services.filter(s => s.node === node.hostname || s.node === node.id);
|
||||||
|
return `
|
||||||
|
<div class="card" style="width:250px; border: 1px solid ${node.health === 'nominal' ? 'var(--border-color)' : 'var(--error)'}">
|
||||||
|
<div class="card-header">
|
||||||
|
<div class="card-title">${node.hostname}</div>
|
||||||
|
<span class="badge ${getStatusClass(node.health)}">${node.health}</span>
|
||||||
|
</div>
|
||||||
|
<div class="label">Capabilities</div>
|
||||||
|
<div class="value" style="font-size:11px">${node.capabilities.join(', ')}</div>
|
||||||
|
<div class="label">Services</div>
|
||||||
|
<div style="font-size:12px; margin-bottom:10px">
|
||||||
|
${nodeServices.length > 0 ? nodeServices.map(s => `
|
||||||
|
<div style="display:flex; justify-content:space-between; margin-bottom:4px; padding:4px; background:rgba(255,255,255,0.03)">
|
||||||
|
<span>${s.name}</span>
|
||||||
|
<span class="${getStatusClass(s.health)}" style="font-size:10px">${s.health}</span>
|
||||||
|
</div>
|
||||||
|
${s.dependencies.length > 0 ? `<div style="font-size:9px; color:var(--text-muted); margin-left:8px; margin-bottom:4px">dep: ${s.dependencies.join(', ')}</div>` : ''}
|
||||||
|
`).join('') : '<div class="value">None</div>'}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
`;
|
||||||
|
}).join('');
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if (currentView === 'correlation') {
|
||||||
|
const incidents = await fetchData('/incidents');
|
||||||
|
const actions = await fetchData('/actions');
|
||||||
|
const list = document.getElementById('correlation-chains');
|
||||||
|
if (incidents && actions) {
|
||||||
|
const allActions = Object.values(actions).flat();
|
||||||
|
list.innerHTML = incidents.map(inc => {
|
||||||
|
const related = allActions.filter(a => a.correlation_chain && a.correlation_chain.includes(inc.id));
|
||||||
|
return `
|
||||||
|
<div class="card">
|
||||||
|
<div class="card-header">
|
||||||
|
<div class="card-title">Incident: ${inc.id || 'INC-001'}</div>
|
||||||
|
<span class="badge status-error">Active</span>
|
||||||
|
</div>
|
||||||
|
<p>${inc.message}</p>
|
||||||
|
<div class="label">Related Actions</div>
|
||||||
|
${related.map(a => `
|
||||||
|
<div class="event" style="margin-top:5px">
|
||||||
|
<strong>${a.type}</strong> (${a.status})<br>
|
||||||
|
<small>${a.description}</small>
|
||||||
|
</div>
|
||||||
|
`).join('') || '<div class="value">No actions yet</div>'}
|
||||||
|
</div>
|
||||||
|
`;
|
||||||
|
}).join('');
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if (currentView === 'settings') {
|
||||||
|
const config = await fetchData('/config');
|
||||||
|
const content = document.getElementById('settings-content');
|
||||||
|
content.innerHTML = `
|
||||||
|
<div class="label">Auto Mode</div>
|
||||||
|
<div class="value">${config.auto_mode ? 'Enabled' : 'Disabled'}</div>
|
||||||
|
<div class="label">Action Thresholds</div>
|
||||||
|
<div class="value mono">${JSON.stringify(config.action_thresholds, null, 2)}</div>
|
||||||
|
<div class="label">Telegram Integration</div>
|
||||||
|
<div class="value" style="color:var(--text-muted)">Ready for mobile approval flows. Hook: /api/v1/telegram/webhook</div>
|
||||||
|
<button onclick="alert('Settings update not implemented in this demo')">Edit Configuration</button>
|
||||||
|
`;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
async function copyForAI() {
|
||||||
|
const btn = document.getElementById('copy-ai-btn');
|
||||||
|
const original = btn.textContent;
|
||||||
|
btn.textContent = 'Copying...';
|
||||||
|
btn.disabled = true;
|
||||||
|
|
||||||
|
try {
|
||||||
|
const snap = await fetchData('/snapshot');
|
||||||
|
if (!snap) throw new Error('snapshot fetch failed');
|
||||||
|
|
||||||
|
const now = new Date(snap.timestamp);
|
||||||
|
const dateStr = now.toISOString().slice(0, 16).replace('T', ' ');
|
||||||
|
const lines = [];
|
||||||
|
|
||||||
|
lines.push(`=== HOMELAB SNAPSHOT ${dateStr} ===`);
|
||||||
|
|
||||||
|
if (snap.nodes && snap.nodes.length > 0) {
|
||||||
|
lines.push('NODES: ' + snap.nodes.map(n =>
|
||||||
|
`${(n.hostname || n.id || '?').toUpperCase()} ${(n.health || 'unknown').toUpperCase()}`
|
||||||
|
).join(', '));
|
||||||
|
} else {
|
||||||
|
lines.push('NODES: none');
|
||||||
|
}
|
||||||
|
|
||||||
|
if (snap.non_nominal_services && snap.non_nominal_services.length > 0) {
|
||||||
|
lines.push('ERRORS: ' + snap.non_nominal_services.map(s =>
|
||||||
|
`${s.name} (${s.node}) - ${s.health}`
|
||||||
|
).join(', '));
|
||||||
|
} else {
|
||||||
|
lines.push(`ERRORS: none (${snap.nominal_service_count} nominal)`);
|
||||||
|
}
|
||||||
|
|
||||||
|
const activeIncidents = (snap.incidents || []).filter(i => !['resolved', 'closed'].includes(i.status));
|
||||||
|
if (activeIncidents.length > 0) {
|
||||||
|
lines.push('INCIDENTS: ' + activeIncidents.map(i =>
|
||||||
|
`[${i.severity}] ${i.message} (${i.node})`
|
||||||
|
).join('; '));
|
||||||
|
} else {
|
||||||
|
lines.push('INCIDENTS: none');
|
||||||
|
}
|
||||||
|
|
||||||
|
if (snap.events && snap.events.length > 0) {
|
||||||
|
lines.push(`EVENTS (last ${snap.events.length}):`);
|
||||||
|
snap.events.forEach(ev => {
|
||||||
|
const ts = ev.timestamp
|
||||||
|
? new Date(ev.timestamp * 1000).toISOString().slice(11, 19)
|
||||||
|
: '?';
|
||||||
|
const svc = ev.service ? '/' + ev.service : '';
|
||||||
|
lines.push(` ${ts} [${ev.severity || ev.level || '?'}] ${ev.type} - ${ev.message || ''} (${ev.node || ''}${svc})`);
|
||||||
|
});
|
||||||
|
} else {
|
||||||
|
lines.push('EVENTS (last 10): none');
|
||||||
|
}
|
||||||
|
|
||||||
|
const s = snap.summary || {};
|
||||||
|
lines.push(`SUMMARY: status=${s.status || '?'} nodes=${s.node_count ?? '?'} services=${s.service_count ?? '?'} incidents=${s.incident_count ?? '?'}`);
|
||||||
|
|
||||||
|
await navigator.clipboard.writeText(lines.join('\n'));
|
||||||
|
btn.textContent = 'Copied!';
|
||||||
|
setTimeout(() => { btn.textContent = original; btn.disabled = false; }, 2000);
|
||||||
|
} catch (e) {
|
||||||
|
console.error('copyForAI error:', e);
|
||||||
|
btn.textContent = 'Error';
|
||||||
|
setTimeout(() => { btn.textContent = original; btn.disabled = false; }, 2000);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Initial load
|
||||||
|
refreshData();
|
||||||
|
// Poll for updates
|
||||||
|
setInterval(refreshData, pollInterval);
|
||||||
|
|
||||||
|
</script>
|
||||||
|
</body>
|
||||||
|
</html>
|
||||||
301
services/agent-system/webui/web.py
Normal file
301
services/agent-system/webui/web.py
Normal file
|
|
@ -0,0 +1,301 @@
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import time
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
|
||||||
|
STATE_DIR = Path(os.getenv("HOMELAB_STATE_ROOT", "/opt/homelab/state"))
|
||||||
|
EVENTS_DIR = Path(os.getenv("HOMELAB_EVENTS_ROOT", "/opt/homelab/events"))
|
||||||
|
WORLD_DIR = Path(os.getenv("HOMELAB_WORLD_ROOT", "/opt/homelab/world"))
|
||||||
|
ACTIONS_DIR = Path(os.getenv("HOMELAB_ACTIONS_ROOT", "/opt/homelab/actions"))
|
||||||
|
CONFIG_DIR = Path(os.getenv("HOMELAB_CONFIG_ROOT", "/opt/homelab/config"))
|
||||||
|
|
||||||
|
STATIC_DIR = Path(__file__).parent
|
||||||
|
|
||||||
|
DEFAULT_CONFIG = {
|
||||||
|
"operator_mode": "approval",
|
||||||
|
"auto_mode": True,
|
||||||
|
"action_thresholds": {
|
||||||
|
"restart_ha": 0.8,
|
||||||
|
"check_network": 0.9,
|
||||||
|
},
|
||||||
|
"default_threshold": 0.9,
|
||||||
|
"allowed_auto_actions": ["restart_ha"],
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def read_json_file(path, default=None):
|
||||||
|
if not path.exists():
|
||||||
|
return default if default is not None else []
|
||||||
|
try:
|
||||||
|
return json.loads(path.read_text())
|
||||||
|
except Exception:
|
||||||
|
return default if default is not None else []
|
||||||
|
|
||||||
|
|
||||||
|
def get_config():
|
||||||
|
config_path = STATE_DIR / "operator-config.json"
|
||||||
|
if config_path.exists():
|
||||||
|
return read_json_file(config_path, DEFAULT_CONFIG)
|
||||||
|
return DEFAULT_CONFIG
|
||||||
|
|
||||||
|
|
||||||
|
def save_config(config):
|
||||||
|
STATE_DIR.mkdir(parents=True, exist_ok=True)
|
||||||
|
(STATE_DIR / "operator-config.json").write_text(json.dumps(config, indent=2))
|
||||||
|
|
||||||
|
|
||||||
|
def current_nodes():
|
||||||
|
return read_json_file(WORLD_DIR / "nodes.json")
|
||||||
|
|
||||||
|
|
||||||
|
def current_services():
|
||||||
|
return read_json_file(WORLD_DIR / "services.json")
|
||||||
|
|
||||||
|
|
||||||
|
def current_deployments():
|
||||||
|
return read_json_file(WORLD_DIR / "deployments.json")
|
||||||
|
|
||||||
|
|
||||||
|
def current_incidents():
|
||||||
|
return read_json_file(WORLD_DIR / "incidents.json")
|
||||||
|
|
||||||
|
|
||||||
|
def current_recommendations():
|
||||||
|
return read_json_file(WORLD_DIR / "recommendations.json")
|
||||||
|
|
||||||
|
|
||||||
|
def current_summary():
|
||||||
|
path = WORLD_DIR / "runtime-summary.json"
|
||||||
|
summary = read_json_file(path, default={})
|
||||||
|
if summary:
|
||||||
|
last_update_val = summary.get("last_update")
|
||||||
|
if last_update_val:
|
||||||
|
try:
|
||||||
|
if isinstance(last_update_val, str):
|
||||||
|
last_update = datetime.fromisoformat(last_update_val.replace('Z', '+00:00')).timestamp()
|
||||||
|
else:
|
||||||
|
last_update = float(last_update_val)
|
||||||
|
except Exception:
|
||||||
|
last_update = os.path.getmtime(path)
|
||||||
|
else:
|
||||||
|
last_update = os.path.getmtime(path)
|
||||||
|
summary["last_update"] = last_update
|
||||||
|
summary["stale"] = (time.time() - last_update) > 60
|
||||||
|
return summary
|
||||||
|
|
||||||
|
|
||||||
|
def current_events():
|
||||||
|
return read_json_file(WORLD_DIR / "events.json", default=[])
|
||||||
|
|
||||||
|
|
||||||
|
def current_actions():
|
||||||
|
actions = {}
|
||||||
|
statuses = ["pending", "approved", "running", "completed", "failed", "rejected"]
|
||||||
|
for status in statuses:
|
||||||
|
actions[status] = []
|
||||||
|
status_dir = ACTIONS_DIR / status
|
||||||
|
if status_dir.exists():
|
||||||
|
for f in status_dir.glob("*.json"):
|
||||||
|
data = read_json_file(f)
|
||||||
|
if data:
|
||||||
|
# Injects some metadata for UI
|
||||||
|
data["id"] = data.get("action_id") or f.stem
|
||||||
|
data["status"] = status
|
||||||
|
actions[status].append(data)
|
||||||
|
return actions
|
||||||
|
|
||||||
|
|
||||||
|
def mutate_action(action_id, target_status):
|
||||||
|
statuses = ["pending", "approved", "running", "completed", "failed", "rejected"]
|
||||||
|
if target_status not in statuses:
|
||||||
|
return False, f"Invalid target status: {target_status}"
|
||||||
|
|
||||||
|
# Find where the action is
|
||||||
|
source_path = None
|
||||||
|
current_status = None
|
||||||
|
for status in statuses:
|
||||||
|
p = ACTIONS_DIR / status / f"{action_id}.json"
|
||||||
|
if p.exists():
|
||||||
|
source_path = p
|
||||||
|
current_status = status
|
||||||
|
break
|
||||||
|
|
||||||
|
if not source_path:
|
||||||
|
return False, f"Action {action_id} not found"
|
||||||
|
|
||||||
|
target_dir = ACTIONS_DIR / target_status
|
||||||
|
target_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
target_path = target_dir / f"{action_id}.json"
|
||||||
|
|
||||||
|
try:
|
||||||
|
data = json.loads(source_path.read_text())
|
||||||
|
data["status"] = target_status
|
||||||
|
data["updated_at"] = time.time()
|
||||||
|
|
||||||
|
# Keep history of transitions
|
||||||
|
history = data.get("transition_history", [])
|
||||||
|
history.append({
|
||||||
|
"from": current_status,
|
||||||
|
"to": target_status,
|
||||||
|
"timestamp": time.time()
|
||||||
|
})
|
||||||
|
data["transition_history"] = history
|
||||||
|
|
||||||
|
target_path.write_text(json.dumps(data, indent=2))
|
||||||
|
if source_path != target_path:
|
||||||
|
source_path.unlink()
|
||||||
|
return True, "Success"
|
||||||
|
except Exception as e:
|
||||||
|
return False, str(e)
|
||||||
|
|
||||||
|
|
||||||
|
def get_snapshot():
|
||||||
|
nodes = current_nodes()
|
||||||
|
services = current_services()
|
||||||
|
incidents = current_incidents()
|
||||||
|
events = current_events()
|
||||||
|
summary = current_summary()
|
||||||
|
|
||||||
|
non_nominal = [s for s in services if s.get("health") != "nominal"]
|
||||||
|
nominal_count = len(services) - len(non_nominal)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"timestamp": datetime.now(timezone.utc).isoformat(),
|
||||||
|
"summary": summary,
|
||||||
|
"nodes": nodes,
|
||||||
|
"non_nominal_services": non_nominal,
|
||||||
|
"nominal_service_count": nominal_count,
|
||||||
|
"total_service_count": len(services),
|
||||||
|
"incidents": incidents,
|
||||||
|
"events": events[:10],
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def send_json(status, payload, handler):
|
||||||
|
body = (json.dumps(payload) + "\n").encode("utf-8")
|
||||||
|
handler.send_response(status)
|
||||||
|
handler.send_header("Content-Type", "application/json")
|
||||||
|
handler.send_header("Content-Length", str(len(body)))
|
||||||
|
handler.end_headers()
|
||||||
|
handler.wfile.write(body)
|
||||||
|
|
||||||
|
|
||||||
|
class Handler(BaseHTTPRequestHandler):
|
||||||
|
def do_GET(self):
|
||||||
|
if self.path == "/config":
|
||||||
|
send_json(200, get_config(), self)
|
||||||
|
return
|
||||||
|
|
||||||
|
if self.path == "/nodes":
|
||||||
|
send_json(200, current_nodes(), self)
|
||||||
|
return
|
||||||
|
|
||||||
|
if self.path == "/services":
|
||||||
|
send_json(200, current_services(), self)
|
||||||
|
return
|
||||||
|
|
||||||
|
if self.path == "/deployments":
|
||||||
|
send_json(200, current_deployments(), self)
|
||||||
|
return
|
||||||
|
|
||||||
|
if self.path == "/incidents":
|
||||||
|
send_json(200, current_incidents(), self)
|
||||||
|
return
|
||||||
|
|
||||||
|
if self.path == "/recommendations":
|
||||||
|
send_json(200, current_recommendations(), self)
|
||||||
|
return
|
||||||
|
|
||||||
|
if self.path == "/summary":
|
||||||
|
send_json(200, current_summary(), self)
|
||||||
|
return
|
||||||
|
|
||||||
|
if self.path == "/events":
|
||||||
|
send_json(200, current_events(), self)
|
||||||
|
return
|
||||||
|
|
||||||
|
if self.path == "/actions":
|
||||||
|
send_json(200, current_actions(), self)
|
||||||
|
return
|
||||||
|
|
||||||
|
if self.path == "/snapshot":
|
||||||
|
send_json(200, get_snapshot(), self)
|
||||||
|
return
|
||||||
|
|
||||||
|
if self.path in ("/", "/index.html"):
|
||||||
|
body = (STATIC_DIR / "index.html").read_bytes()
|
||||||
|
self.send_response(200)
|
||||||
|
self.send_header("Content-Type", "text/html; charset=utf-8")
|
||||||
|
self.send_header("Content-Length", str(len(body)))
|
||||||
|
self.end_headers()
|
||||||
|
self.wfile.write(body)
|
||||||
|
return
|
||||||
|
|
||||||
|
self.send_error(404)
|
||||||
|
|
||||||
|
def do_POST(self):
|
||||||
|
if self.path not in (
|
||||||
|
"/config",
|
||||||
|
"/action/mutate",
|
||||||
|
"/mode",
|
||||||
|
):
|
||||||
|
self.send_error(404)
|
||||||
|
return
|
||||||
|
|
||||||
|
length = int(self.headers.get("Content-Length", "0"))
|
||||||
|
raw_body = self.rfile.read(length).decode("utf-8")
|
||||||
|
try:
|
||||||
|
payload = json.loads(raw_body)
|
||||||
|
except json.JSONDecodeError:
|
||||||
|
self.send_error(400, "Invalid JSON")
|
||||||
|
return
|
||||||
|
|
||||||
|
if self.path == "/config":
|
||||||
|
config = get_config()
|
||||||
|
config.update(payload)
|
||||||
|
save_config(config)
|
||||||
|
send_json(200, {"status": "ok"}, self)
|
||||||
|
return
|
||||||
|
|
||||||
|
if self.path == "/mode":
|
||||||
|
mode = payload.get("mode")
|
||||||
|
if not mode:
|
||||||
|
self.send_error(400, "mode is required")
|
||||||
|
return
|
||||||
|
config = get_config()
|
||||||
|
config["operator_mode"] = mode
|
||||||
|
save_config(config)
|
||||||
|
send_json(200, {"status": "ok"}, self)
|
||||||
|
return
|
||||||
|
|
||||||
|
if self.path == "/action/mutate":
|
||||||
|
action_id = payload.get("id")
|
||||||
|
target = payload.get("status")
|
||||||
|
if not action_id or not target:
|
||||||
|
self.send_error(400, "id and status are required")
|
||||||
|
return
|
||||||
|
success, msg = mutate_action(action_id, target)
|
||||||
|
if success:
|
||||||
|
send_json(200, {"status": "ok"}, self)
|
||||||
|
else:
|
||||||
|
self.send_error(500, msg)
|
||||||
|
return
|
||||||
|
|
||||||
|
def log_message(self, format, *args):
|
||||||
|
return
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
# Ensure directories exist
|
||||||
|
for d in [STATE_DIR, EVENTS_DIR, WORLD_DIR, ACTIONS_DIR, CONFIG_DIR]:
|
||||||
|
d.mkdir(parents=True, exist_ok=True)
|
||||||
|
for s in ["pending", "approved", "running", "completed", "failed", "rejected"]:
|
||||||
|
(ACTIONS_DIR / s).mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
port = int(os.getenv("PORT", "8080"))
|
||||||
|
print(f"Operator Control Plane starting on 0.0.0.0:{port}")
|
||||||
|
server = ThreadingHTTPServer(("0.0.0.0", port), Handler)
|
||||||
|
server.serve_forever()
|
||||||
10
services/brain-watchdog/Dockerfile
Normal file
10
services/brain-watchdog/Dockerfile
Normal file
|
|
@ -0,0 +1,10 @@
|
||||||
|
FROM python:3.11-slim
|
||||||
|
|
||||||
|
WORKDIR /app
|
||||||
|
|
||||||
|
COPY src/ src/
|
||||||
|
|
||||||
|
ENV PYTHONUNBUFFERED=1
|
||||||
|
ENV PYTHONPATH=/app/src
|
||||||
|
|
||||||
|
CMD ["python", "-m", "brain_watchdog.main"]
|
||||||
30
services/brain-watchdog/docker-compose.yml
Normal file
30
services/brain-watchdog/docker-compose.yml
Normal file
|
|
@ -0,0 +1,30 @@
|
||||||
|
services:
|
||||||
|
brain-watchdog:
|
||||||
|
build: .
|
||||||
|
container_name: brain-watchdog
|
||||||
|
restart: unless-stopped
|
||||||
|
|
||||||
|
env_file:
|
||||||
|
- /opt/homelab/config/brain-watchdog/.env
|
||||||
|
|
||||||
|
volumes:
|
||||||
|
- brain_watchdog_data:/data
|
||||||
|
|
||||||
|
healthcheck:
|
||||||
|
test:
|
||||||
|
- "CMD"
|
||||||
|
- "python"
|
||||||
|
- "-c"
|
||||||
|
- |
|
||||||
|
import os, time, json, sys
|
||||||
|
p = '/data/state.json'
|
||||||
|
if not os.path.exists(p): sys.exit(1)
|
||||||
|
age = time.time() - os.path.getmtime(p)
|
||||||
|
sys.exit(0 if age < 300 else 1)
|
||||||
|
interval: 1m
|
||||||
|
timeout: 10s
|
||||||
|
retries: 3
|
||||||
|
start_period: 30s
|
||||||
|
|
||||||
|
volumes:
|
||||||
|
brain_watchdog_data:
|
||||||
7
services/brain-watchdog/env.example
Normal file
7
services/brain-watchdog/env.example
Normal file
|
|
@ -0,0 +1,7 @@
|
||||||
|
CONTROL_PLANE_URL=
|
||||||
|
STALE_THRESHOLD=600
|
||||||
|
INTERVAL=60
|
||||||
|
FAILS_BEFORE_ALERT=3
|
||||||
|
TG_TOKEN=
|
||||||
|
TG_CHAT_ID=
|
||||||
|
HEALTHCHECKS_URL=
|
||||||
10
services/brain-watchdog/healthcheck.sh
Executable file
10
services/brain-watchdog/healthcheck.sh
Executable file
|
|
@ -0,0 +1,10 @@
|
||||||
|
#!/bin/sh
|
||||||
|
# Healthy if state.json was written within the last 5 minutes.
|
||||||
|
python -c "
|
||||||
|
import os, time, sys
|
||||||
|
p = '/data/state.json'
|
||||||
|
if not os.path.exists(p):
|
||||||
|
sys.exit(1)
|
||||||
|
age = time.time() - os.path.getmtime(p)
|
||||||
|
sys.exit(0 if age < 300 else 1)
|
||||||
|
"
|
||||||
3
services/brain-watchdog/pytest.ini
Normal file
3
services/brain-watchdog/pytest.ini
Normal file
|
|
@ -0,0 +1,3 @@
|
||||||
|
[pytest]
|
||||||
|
pythonpath = src
|
||||||
|
testpaths = tests
|
||||||
34
services/brain-watchdog/service.yaml
Normal file
34
services/brain-watchdog/service.yaml
Normal file
|
|
@ -0,0 +1,34 @@
|
||||||
|
service:
|
||||||
|
name: brain-watchdog
|
||||||
|
owner_node: piha
|
||||||
|
exposure: private
|
||||||
|
description: >
|
||||||
|
External watchdog for the control-plane on VPS. Queries /summary over
|
||||||
|
Tailscale and alerts via Telegram Bot API directly — no dependency on the
|
||||||
|
control-plane itself. Freshness is computed locally from last_update epoch.
|
||||||
|
|
||||||
|
dependencies:
|
||||||
|
- control-plane # external — on VPS; deliberately untrusted for liveness
|
||||||
|
|
||||||
|
healthcheck:
|
||||||
|
type: docker
|
||||||
|
interval: 60s
|
||||||
|
timeout: 10s
|
||||||
|
retries: 3
|
||||||
|
start_period: 30s
|
||||||
|
|
||||||
|
restart_policy: unless-stopped
|
||||||
|
|
||||||
|
persistence:
|
||||||
|
paths:
|
||||||
|
- /data # state.json: fail_count, alerted, last_ok
|
||||||
|
|
||||||
|
runtime:
|
||||||
|
env_vars:
|
||||||
|
- CONTROL_PLANE_URL # Tailscale IP + port of operator-ui (required)
|
||||||
|
- STALE_THRESHOLD # seconds before brain is considered stale (default: 600)
|
||||||
|
- INTERVAL # poll interval seconds (default: 60)
|
||||||
|
- FAILS_BEFORE_ALERT # consecutive failures before Telegram alert (default: 3)
|
||||||
|
- TG_TOKEN # Telegram Bot API token (required)
|
||||||
|
- TG_CHAT_ID # Telegram chat/user ID (required)
|
||||||
|
- HEALTHCHECKS_URL # optional healthchecks.io ping URL
|
||||||
157
services/brain-watchdog/src/brain_watchdog/main.py
Normal file
157
services/brain-watchdog/src/brain_watchdog/main.py
Normal file
|
|
@ -0,0 +1,157 @@
|
||||||
|
"""
|
||||||
|
brain-watchdog: external watchdog for the control-plane on VPS.
|
||||||
|
|
||||||
|
Runs on PIHA; queries /summary directly over Tailscale and alerts via
|
||||||
|
Telegram Bot API without going through the control-plane itself.
|
||||||
|
Never trusts the self-reported "status" field — freshness is computed
|
||||||
|
locally from last_update epoch vs. time.time().
|
||||||
|
"""
|
||||||
|
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import time
|
||||||
|
import urllib.error
|
||||||
|
import urllib.request
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
CONTROL_PLANE_URL = os.environ["CONTROL_PLANE_URL"].rstrip("/")
|
||||||
|
STALE_THRESHOLD = int(os.environ.get("STALE_THRESHOLD", "600"))
|
||||||
|
INTERVAL = int(os.environ.get("INTERVAL", "60"))
|
||||||
|
FAILS_BEFORE_ALERT = int(os.environ.get("FAILS_BEFORE_ALERT", "3"))
|
||||||
|
TG_TOKEN = os.environ["TG_TOKEN"]
|
||||||
|
TG_CHAT_ID = os.environ["TG_CHAT_ID"]
|
||||||
|
HEALTHCHECKS_URL = os.environ.get("HEALTHCHECKS_URL", "").strip()
|
||||||
|
|
||||||
|
STATE_FILE = Path("/data/state.json")
|
||||||
|
|
||||||
|
|
||||||
|
def load_state() -> dict:
|
||||||
|
if STATE_FILE.exists():
|
||||||
|
try:
|
||||||
|
return json.loads(STATE_FILE.read_text())
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
return {"fail_count": 0, "alerted": False, "last_ok": 0.0}
|
||||||
|
|
||||||
|
|
||||||
|
def save_state(state: dict) -> None:
|
||||||
|
STATE_FILE.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
STATE_FILE.write_text(json.dumps(state))
|
||||||
|
|
||||||
|
|
||||||
|
def http_get(url: str, timeout: int = 10) -> tuple[int | None, dict | None]:
|
||||||
|
try:
|
||||||
|
with urllib.request.urlopen(url, timeout=timeout) as resp:
|
||||||
|
return resp.status, json.loads(resp.read())
|
||||||
|
except urllib.error.HTTPError as exc:
|
||||||
|
return exc.code, None
|
||||||
|
except Exception:
|
||||||
|
return None, None
|
||||||
|
|
||||||
|
|
||||||
|
def send_telegram(message: str) -> bool:
|
||||||
|
url = f"https://api.telegram.org/bot{TG_TOKEN}/sendMessage"
|
||||||
|
payload = json.dumps(
|
||||||
|
{"chat_id": TG_CHAT_ID, "text": message, "parse_mode": "HTML"}
|
||||||
|
).encode()
|
||||||
|
req = urllib.request.Request(
|
||||||
|
url, data=payload, headers={"Content-Type": "application/json"}
|
||||||
|
)
|
||||||
|
try:
|
||||||
|
with urllib.request.urlopen(req, timeout=10) as resp:
|
||||||
|
return resp.status == 200
|
||||||
|
except Exception as exc:
|
||||||
|
print(f"[telegram] send failed: {exc}", flush=True)
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def ping_healthchecks() -> None:
|
||||||
|
if not HEALTHCHECKS_URL:
|
||||||
|
return
|
||||||
|
try:
|
||||||
|
urllib.request.urlopen(HEALTHCHECKS_URL, timeout=10)
|
||||||
|
except Exception as exc:
|
||||||
|
print(f"[healthchecks] ping failed: {exc}", flush=True)
|
||||||
|
|
||||||
|
|
||||||
|
def check() -> tuple[bool, str]:
|
||||||
|
"""Return (ok, human-readable reason). Never reads 'status' field."""
|
||||||
|
status, body = http_get(f"{CONTROL_PLANE_URL}/summary")
|
||||||
|
|
||||||
|
if status is None:
|
||||||
|
return False, "panel unreachable (connection error)"
|
||||||
|
|
||||||
|
if status != 200:
|
||||||
|
return False, f"panel returned HTTP {status}"
|
||||||
|
|
||||||
|
if not body:
|
||||||
|
return False, "panel returned empty / invalid JSON"
|
||||||
|
|
||||||
|
raw = body.get("last_update")
|
||||||
|
if raw is None:
|
||||||
|
return False, "summary missing last_update field"
|
||||||
|
|
||||||
|
try:
|
||||||
|
last_update_ts = float(raw)
|
||||||
|
except (TypeError, ValueError):
|
||||||
|
return False, f"last_update not parseable: {raw!r}"
|
||||||
|
|
||||||
|
age = time.time() - last_update_ts
|
||||||
|
if age > STALE_THRESHOLD:
|
||||||
|
return False, (
|
||||||
|
f"brain stale: last update {int(age // 60)}m ago "
|
||||||
|
f"(threshold {STALE_THRESHOLD // 60}m)"
|
||||||
|
)
|
||||||
|
|
||||||
|
return True, f"ok (age {int(age)}s)"
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
print(
|
||||||
|
f"[brain-watchdog] starting — "
|
||||||
|
f"url={CONTROL_PLANE_URL} "
|
||||||
|
f"stale_threshold={STALE_THRESHOLD}s "
|
||||||
|
f"interval={INTERVAL}s "
|
||||||
|
f"fails_before_alert={FAILS_BEFORE_ALERT}",
|
||||||
|
flush=True,
|
||||||
|
)
|
||||||
|
state = load_state()
|
||||||
|
|
||||||
|
while True:
|
||||||
|
ok, reason = check()
|
||||||
|
ts = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
|
||||||
|
print(f"[{ts}] {'OK ' if ok else 'FAIL'} — {reason}", flush=True)
|
||||||
|
|
||||||
|
if ok:
|
||||||
|
if state["alerted"]:
|
||||||
|
send_telegram(
|
||||||
|
"✅ <b>brain-watchdog: control-plane RECOVERED</b>\n"
|
||||||
|
f"{reason}"
|
||||||
|
)
|
||||||
|
print("[telegram] sent recovery alert", flush=True)
|
||||||
|
state["fail_count"] = 0
|
||||||
|
state["alerted"] = False
|
||||||
|
state["last_ok"] = time.time()
|
||||||
|
save_state(state)
|
||||||
|
ping_healthchecks()
|
||||||
|
else:
|
||||||
|
state["fail_count"] = state.get("fail_count", 0) + 1
|
||||||
|
save_state(state)
|
||||||
|
|
||||||
|
if state["fail_count"] >= FAILS_BEFORE_ALERT and not state["alerted"]:
|
||||||
|
sent = send_telegram(
|
||||||
|
"🚨 <b>brain-watchdog: control-plane DOWN</b>\n"
|
||||||
|
f"Reason: {reason}\n"
|
||||||
|
f"Consecutive failures: {state['fail_count']}\n"
|
||||||
|
f"URL: <code>{CONTROL_PLANE_URL}</code>"
|
||||||
|
)
|
||||||
|
if sent:
|
||||||
|
state["alerted"] = True
|
||||||
|
save_state(state)
|
||||||
|
print("[telegram] sent alert", flush=True)
|
||||||
|
|
||||||
|
time.sleep(INTERVAL)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
0
services/brain-watchdog/tests/__init__.py
Normal file
0
services/brain-watchdog/tests/__init__.py
Normal file
66
services/brain-watchdog/tests/test_main.py
Normal file
66
services/brain-watchdog/tests/test_main.py
Normal file
|
|
@ -0,0 +1,66 @@
|
||||||
|
"""
|
||||||
|
Tests for brain_watchdog.main.
|
||||||
|
|
||||||
|
Module-level env vars are required at import time; set them before the first
|
||||||
|
import of the module so tests can run without a real control-plane.
|
||||||
|
"""
|
||||||
|
import importlib.util
|
||||||
|
import os
|
||||||
|
import time
|
||||||
|
from unittest.mock import patch
|
||||||
|
|
||||||
|
os.environ.setdefault("CONTROL_PLANE_URL", "http://test-cp:8080")
|
||||||
|
os.environ.setdefault("TG_TOKEN", "test_token")
|
||||||
|
os.environ.setdefault("TG_CHAT_ID", "12345")
|
||||||
|
|
||||||
|
import brain_watchdog.main as bwm
|
||||||
|
|
||||||
|
|
||||||
|
def test_package_importable():
|
||||||
|
spec = importlib.util.find_spec("brain_watchdog")
|
||||||
|
assert spec is not None
|
||||||
|
|
||||||
|
|
||||||
|
def test_check_ok_fresh():
|
||||||
|
now = time.time()
|
||||||
|
with patch.object(bwm, "http_get", return_value=(200, {"last_update": now - 10})):
|
||||||
|
ok, reason = bwm.check()
|
||||||
|
assert ok
|
||||||
|
assert "ok" in reason
|
||||||
|
|
||||||
|
|
||||||
|
def test_check_fail_stale():
|
||||||
|
now = time.time()
|
||||||
|
stale_ts = now - (bwm.STALE_THRESHOLD + 120)
|
||||||
|
with patch.object(bwm, "http_get", return_value=(200, {"last_update": stale_ts})):
|
||||||
|
ok, reason = bwm.check()
|
||||||
|
assert not ok
|
||||||
|
assert "stale" in reason
|
||||||
|
|
||||||
|
|
||||||
|
def test_check_fail_unreachable():
|
||||||
|
with patch.object(bwm, "http_get", return_value=(None, None)):
|
||||||
|
ok, reason = bwm.check()
|
||||||
|
assert not ok
|
||||||
|
assert "unreachable" in reason
|
||||||
|
|
||||||
|
|
||||||
|
def test_check_fail_http_error():
|
||||||
|
with patch.object(bwm, "http_get", return_value=(503, None)):
|
||||||
|
ok, reason = bwm.check()
|
||||||
|
assert not ok
|
||||||
|
assert "503" in reason
|
||||||
|
|
||||||
|
|
||||||
|
def test_check_fail_missing_last_update():
|
||||||
|
with patch.object(bwm, "http_get", return_value=(200, {"other": "data"})):
|
||||||
|
ok, reason = bwm.check()
|
||||||
|
assert not ok
|
||||||
|
assert "last_update" in reason
|
||||||
|
|
||||||
|
|
||||||
|
def test_check_fail_unparseable_timestamp():
|
||||||
|
with patch.object(bwm, "http_get", return_value=(200, {"last_update": "not-a-number"})):
|
||||||
|
ok, reason = bwm.check()
|
||||||
|
assert not ok
|
||||||
|
assert "parseable" in reason
|
||||||
24
services/control-plane/Dockerfile
Normal file
24
services/control-plane/Dockerfile
Normal file
|
|
@ -0,0 +1,24 @@
|
||||||
|
FROM python:3.11-slim
|
||||||
|
|
||||||
|
WORKDIR /app
|
||||||
|
|
||||||
|
RUN pip install --no-cache-dir pyyaml
|
||||||
|
|
||||||
|
# Create homelab user
|
||||||
|
RUN useradd -m -u 1000 homelab
|
||||||
|
|
||||||
|
# Copy sources
|
||||||
|
COPY src/ /app/src/
|
||||||
|
# Also need the observer script if we want to run it from here,
|
||||||
|
# but I'll copy it from the repo during build or mount it.
|
||||||
|
# Actually, I'll copy the entire scripts/ directory to /repo/scripts
|
||||||
|
# so the supervisor/executor can find them.
|
||||||
|
|
||||||
|
# For simplicity, we'll assume the repo is mounted at /repo
|
||||||
|
ENV REPO_ROOT=/repo
|
||||||
|
ENV RUNTIME_PATH=/opt/homelab
|
||||||
|
ENV PYTHONUNBUFFERED=1
|
||||||
|
|
||||||
|
# Default command (will be overridden in docker-compose)
|
||||||
|
USER homelab
|
||||||
|
CMD ["python", "src/operator_ui.py"]
|
||||||
73
services/control-plane/deploy-local.sh
Executable file
73
services/control-plane/deploy-local.sh
Executable file
|
|
@ -0,0 +1,73 @@
|
||||||
|
#!/bin/bash
|
||||||
|
# services/control-plane/deploy-local.sh
|
||||||
|
set -e
|
||||||
|
|
||||||
|
# 1. Validate it is deploying control-plane
|
||||||
|
if [[ ! $(pwd) == *"/services/control-plane" ]]; then
|
||||||
|
echo "Error: Script must be run from services/control-plane directory"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ ! -f "docker-compose.yml" ]]; then
|
||||||
|
echo "Error: docker-compose.yml not found"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "--- Preparing Control Plane Directories ---"
|
||||||
|
# 2. Prepare required dirs
|
||||||
|
# /opt/homelab/config
|
||||||
|
# /opt/homelab/actions/{pending,approved,rejected,running,completed,failed}
|
||||||
|
# /opt/homelab/world
|
||||||
|
# /opt/homelab/state
|
||||||
|
|
||||||
|
DIRS=(
|
||||||
|
"/opt/homelab/config"
|
||||||
|
"/opt/homelab/actions/pending"
|
||||||
|
"/opt/homelab/actions/approved"
|
||||||
|
"/opt/homelab/actions/rejected"
|
||||||
|
"/opt/homelab/actions/running"
|
||||||
|
"/opt/homelab/actions/completed"
|
||||||
|
"/opt/homelab/actions/failed"
|
||||||
|
"/opt/homelab/world"
|
||||||
|
"/opt/homelab/state"
|
||||||
|
)
|
||||||
|
|
||||||
|
for dir in "${DIRS[@]}"; do
|
||||||
|
if [ ! -d "$dir" ]; then
|
||||||
|
echo "Creating $dir"
|
||||||
|
sudo mkdir -p "$dir"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
# 3. chown/chmod for UID 1000 — self-healing: only calls sudo when actually needed
|
||||||
|
echo "Checking /opt/homelab ownership..."
|
||||||
|
_chown_needed=$(find /opt/homelab \( ! -uid 1000 -o ! -gid 1000 \) -print -quit 2>/dev/null)
|
||||||
|
if [[ -n "$_chown_needed" ]]; then
|
||||||
|
echo "Found files not owned by 1000:1000 (e.g. $_chown_needed) — fixing..."
|
||||||
|
sudo chown -R 1000:1000 /opt/homelab
|
||||||
|
else
|
||||||
|
echo "Ownership already correct, skipping chown"
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "Checking /opt/homelab directory permissions..."
|
||||||
|
_chmod_needed=$(find /opt/homelab -type d ! -perm -775 -print -quit 2>/dev/null)
|
||||||
|
if [[ -n "$_chmod_needed" ]]; then
|
||||||
|
echo "Found directories with wrong permissions (e.g. $_chmod_needed) — fixing..."
|
||||||
|
sudo chmod -R 775 /opt/homelab 2>/dev/null || true
|
||||||
|
else
|
||||||
|
echo "Permissions already correct, skipping chmod"
|
||||||
|
fi
|
||||||
|
|
||||||
|
# 4. Run docker compose up -d --build --force-recreate
|
||||||
|
echo "--- Starting Control Plane Services ---"
|
||||||
|
COMPOSE_ARGS="-f docker-compose.yml"
|
||||||
|
OVERRIDE_FILE="../../hosts/vps/runtime/control-plane/docker-compose.override.yml"
|
||||||
|
if [ -f "$OVERRIDE_FILE" ]; then
|
||||||
|
echo "Using override: $OVERRIDE_FILE"
|
||||||
|
COMPOSE_ARGS="$COMPOSE_ARGS -f $OVERRIDE_FILE"
|
||||||
|
fi
|
||||||
|
docker compose $COMPOSE_ARGS up -d --build --force-recreate
|
||||||
|
|
||||||
|
# 5. Print docker ps for control-plane containers
|
||||||
|
echo "--- Deployment Status ---"
|
||||||
|
docker ps --filter "name=control-plane"
|
||||||
Some files were not shown because too many files have changed in this diff Show more
Loading…
Reference in a new issue