Compare commits

...

100 commits

Author SHA1 Message Date
Oskar Kapala 58ac6edd7d fix(stability-agent): run as uid 1000 with docker group access
stability-agent had no USER instruction and no user: in compose, running
as root and writing root-owned files to /opt/homelab bind-mount.

- Dockerfile: add useradd -m -u 1000 homelab + USER homelab
- docker-compose.yml: add user: "1000:1000" and group_add: ["999"]
  (GID 999 = docker group on VPS) to retain docker.sock:ro access

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-03 18:20:54 +02:00
Oskar Kapala 19fd8799d9 fix(node-agent): run as uid 1000 with docker group access
node-agent had no USER instruction and no user: in compose, running
as root and writing root-owned files to /opt/homelab bind-mount.

- Dockerfile: add useradd -m -u 1000 homelab + USER homelab
- docker-compose.yml: add user: "1000:1000" and group_add: ["999"]
  (GID 999 = docker group on VPS) to retain docker.sock access

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-03 18:20:31 +02:00
Oskar Kapala 7f17b65278 fix(control-plane): run executor as uid 1000 with docker group access
Executor was the only control-plane container running as root (uid=0),
writing root-owned files to /opt/homelab via bind-mount and triggering
false sudo on every deploy.

- Dockerfile: add USER homelab after useradd (useradd already present)
- docker-compose.yml: add user: "1000:1000" and group_add: ["999"]
  (GID 999 = docker group on VPS) so executor retains docker.sock access

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-03 18:19:58 +02:00
Oskar Kapala e6a2443412 fix(dev): agent.sh worktree_count/paths grep exit-1 on empty set
grep -cv (and grep -v) return exit code 1 when there are zero matches.
With set -euo pipefail this silently aborted the script before count
was returned — causing 'agent.sh new' to fail on a fresh repo with no
existing worktrees.

Fix: move the grep -v into worktree_paths with '|| true' so the
function always exits 0, then derive worktree_count via wc -l.
2026-06-03 18:04:38 +02:00
Oskar Kapala f9b145585f fix(dev): agent.sh validate_name set -e safety + ERR trap
Refactor [ test ] && prefail pattern to if/then/fi — set -euo pipefail
was silently exiting after the loop because the failing-test compound
propagated exit code 1 through the function return.

Add ERR trap so future silent fails get diagnosed at the source.
2026-06-03 18:02:50 +02:00
Oskar Kapala 3b620ef7e3 docs(claude): multi-agent worktree mode section
Main checkout = deploy-only. .agent-task marker triggers mandatory
loading of worktree-aware skill. Only the human runs scripts/dev/agent.sh.
2026-06-03 17:41:35 +02:00
Oskar Kapala 745e52723c feat(skills): worktree-aware skill for Claude Code
Encodes branch hygiene for CC running in task worktrees: commit only to
assigned branch, no push origin master, no touching main checkout, no
git add -A, no worktree management, mandatory final report.
2026-06-03 17:41:35 +02:00
Oskar Kapala 1abe925f65 feat(dev): scripts/dev/agent.sh — multi-agent worktree dispatcher
new/list/merge/clean. Decisions: branch task/<name>, sibling worktree
~/homelab-codex-ws-<name>, ff-only auto-merge, cap 4.
2026-06-03 17:41:35 +02:00
Oskar Kapala 1c69a5bc29 feat(skills): save-session skill for Claude Code
Records session facts (git log, diff --stat, deploys from transcript)
by appending to docs/sessions/YYYY-MM-DD.md with a mandatory narrative
placeholder. Never touches backlog.md or CLAUDE.md without explicit
instruction. Commits only the session file.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-03 16:06:46 +02:00
Oskar Kapala 02e7c28823 feat(skills): deploy skill for Claude Code
Instructs CC to always route deploy/redeploy/ship/wdróż requests through
scripts/deploy/deploy.sh, maps exit codes to required actions, and
enforces no-bypass rules for gate and branch checks.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-03 16:06:40 +02:00
Oskar Kapala db592fbc28 feat(deploy): Saturn-side dispatcher wrapper
Replaces the per-node staged framework with a single entry point that
runs from SATURN: preflight (branch/clean-tree/push/SSH), gate (pytest +
docker build per service), execute (control-plane.sh --ssh or remote
deploy-node.sh), verify (docker ps), and one-line report.

Exit codes: 0=ok 1=preflight 2=gate 3=execute 4=verify 5=sudo-handoff.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-03 16:06:36 +02:00
Oskar Kapala 00fc36df3a fix(deploy): skip sudo chown/chmod when /opt/homelab ownership is already correct
deploy-local.sh previously ran `sudo chown -R 1000:1000` and
`sudo chmod -R 775` unconditionally on every deploy, which blocked
non-TTY execution (CC/CI) on VPS where /opt/homelab is already 1000:1000.

Both steps are now conditional using `find ... -print -quit`:
- chown: runs only if any file/dir is NOT uid/gid 1000
- chmod: runs only if any directory is missing -775 permission bits

When everything is correct (steady state on VPS), both steps log
"already correct, skipping" and never invoke sudo.  If a new directory
was created by root (e.g. a manual mkdir, volume mount, or restart artefact),
the remediation path triggers automatically — the self-heal property is preserved.

Smoke-tested in Docker (ubuntu:22.04):
  Case 1 (1000:1000 + 775):  chown skipped, chmod skipped ✓
  Case 2 (root-owned subdir): chown triggered ✓
  Case 3 (700 dir perms):     chmod triggered ✓

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-03 15:44:44 +02:00
Oskar Kapala f5dcefc752 fix(observer): robust incident lifecycle + orphan auto-resolve
Two root causes for stale "active" incidents on the dashboard:

1. TypeError bug in _prune_stale_world: last_occurrence / resolved_at
   can be an ISO-8601 string (stability-agent via events.py) or a Unix
   int (node-agent).  The previous session's auto-resolve did plain
   `time.time() - last_occ` which raises TypeError for strings,
   silently preventing _save_world() from being called and leaving
   incidents perpetually "active" on disk.

   Fix: add _parse_ts(ts) -> float that handles int, float, and
   ISO-8601 strings uniformly. All timestamp arithmetic now goes through
   it; returns 0.0 on None / garbage to keep comparisons safe.

2. Orphaned active incidents: _resolve_incident clears service["incident_id"]
   and marks the incident "resolved" in memory, but if incidents.json was
   truncated mid-write (pre-atomic-write era), the observer loaded it at
   next startup with status="active" and no service entry pointing to it.
   No code ever touched these orphans again.

   Fix: _prune_stale_world now runs two cleanup passes each cycle:
   - Case 1 (healthy-linked): service.status=="healthy" AND incident_id
     still set → resolve immediately (service cannot have active incident)
   - Case 2 (orphaned): active incident with no service link AND
     last_occurrence > 5 min ago → resolve (5-min guard for creation race)

   Both cases are wrapped in try/except so a bug here never crashes the
   observer loop or blocks _save_world.

   Also fixes the 7-day stale-incident prune to use _parse_ts so
   ISO-string resolved_at values are handled correctly.

3. Operator UI: current_incidents() now filters to status=="active" only.
   Resolved incidents were previously included in the /incidents endpoint,
   making the dashboard show a wall of historical records as if active.

Nocturnal job investigation: _cleanup_control_plane_fs in node-agent runs
every 60s on VPS (not midnight-specific); it reads observer_checkpoint.json
(now written atomically) and deletes old event files. No non-atomic writes
found. Midnight clustering was likely external (logrotate / OS flush);
the supervisor's resilient loader already handles such transient issues.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-03 14:29:12 +02:00
Oskar Kapala 98437d46b2 test(control-plane): atomic write and resilient loader coverage
11 new test cases in test_state_reliability.py covering:
- atomic_write_json: produces valid JSON, no .tmp left behind, overwrites,
  works with nested structures
- _load_actual_state: returns False on empty / truncated file, returns True
  on valid files, preserves last-known-good state across a parse failure
- reconcile: empty/truncated services.json or incidents.json generates zero
  actions (skip-cycle semantics proven end-to-end)
- healthy service with valid world state generates no spurious action

All 32 tests (11 new + 21 existing) pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-03 12:27:05 +02:00
Oskar Kapala 5e97b4e448 fix(supervisor): atomic writes + skip cycle on unreadable world state
Two independent fixes for the false-alarm storm caused by race-condition
reads of truncated world state files:

1. Atomic writes: _atomic_write_json (write→fsync→os.replace) replaces
   all bare open('w')+json.dump calls in supervisor and executor, so the
   action-file pipeline is never visible in a half-written state.

2. Resilient loader: _load_actual_state now returns False when any world
   state file fails to parse (empty or truncated mid-write). reconcile()
   skips the entire drift check on False instead of treating {} as "all
   services missing". actual_state retains its last-known-good values so
   a single bad cycle does not wipe accumulated context.

   Before: parse error → raw[key]={} → all desired services missing →
     wall of redeploy actions → drift_resolved_auto churn on next cycle.
   After:  parse error → WARNING logged → cycle skipped → no actions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-03 12:26:59 +02:00
Oskar Kapala ffb0608b9a fix(observer): atomic writes for world state files
All JSON state writes (services.json, nodes.json, incidents.json,
deployments.json, runtime-summary.json, observer_checkpoint.json) now use
_atomic_write_json: write to a .tmp sibling, fsync, then os.replace.
This eliminates the truncated-write window that caused supervisors
reading mid-write files to see empty/partial JSON.

Also adds auto-resolution of phantom active incidents: if a service
reports status=healthy and its incident's last_occurrence is >30 min old,
the incident is resolved in _prune_stale_world. This clears false active
incidents accumulated from previous race-condition reads.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-03 12:26:49 +02:00
Oskar Kapala f381023206 docs(claude): add Definition of Done for services (smoke test + pytest)
Lesson from brain-watchdog: code that was never run had a packaging bug
that caused a crash loop in production. New rule: docker build + short
smoke-run + pytest before any commit or deploy.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-01 20:38:39 +02:00
Oskar Kapala cb4ae756ab test(brain-watchdog): add pytest suite covering import and check() logic
7 cases: package importable, fresh ok, stale, unreachable, HTTP error,
missing last_update field, unparseable timestamp. pytest.ini sets pythonpath=src
so tests run without PYTHONPATH set in the environment.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-01 20:38:24 +02:00
Oskar Kapala cfe5e02372 fix(brain-watchdog): add PYTHONPATH=/app/src so brain_watchdog package is importable
WORKDIR is /app but the package lives under src/; without PYTHONPATH set
`python -m brain_watchdog.main` raised ModuleNotFoundError on startup.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-01 20:31:45 +02:00
Oskar Kapala 039f9f7247 feat(piha): brain-watchdog — external watchdog for control-plane
Polls /summary on VPS over Tailscale every 60s; computes freshness
locally from last_update epoch (never trusts self-reported status).
Alerts via Telegram Bot API directly after 3 consecutive failures;
sends recovery message on heal. State (fail_count, alerted) persisted
to volume so debounce survives restarts.

- services/brain-watchdog/: Python service, no external deps (stdlib only)
- hosts/piha/runtime/brain-watchdog/: override with mem_limit 64m
- hosts/piha/services.yaml + inventory/topology.yaml: manifest entries

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-01 17:54:36 +02:00
Oskar Kapala 495741e7ac operator-ui: /events bez ladowania calego katalogu + daemon threads; epoch z regexa (fix chelsty-infra)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-01 16:34:52 +02:00
Oskar Kapala 43c5d45353 deploy: chmod/chown na /opt/homelab odporne na znikające pliki eventow 2026-06-01 14:35:19 +02:00
Oskar Kapala f64cec645e vps: mem_limit + oom_score_adj na serwisach in-repo; deploy-local stosuje override (stop OOM) 2026-06-01 14:23:58 +02:00
Oskar Kapala 1db9db7d03 fix(dashboard): read last_update from JSON content, not file mtime
operator_ui.py called .replace() on last_update without checking type —
an integer value (written by the materializer) raised AttributeError and
silently fell back to os.path.getmtime(), which was stuck at 5/29 after a
deploy with preserved timestamps. web.py had the same class of bug but
worse: it unconditionally replaced last_update with mtime, ignoring the
JSON field entirely. Both now branch on isinstance(str) and cast numeric
values directly to float, with mtime only as a last-resort fallback.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-31 22:10:50 +02:00
Oskar Kapala 52607a7cdd feat(control-plane): shadow_mode for HA event auto-actions + deploy docs
- HA_DIAG_SHADOW_MODE env flag in supervisor (default true)
- shadow_mode downgrades container_restart actions to alert_only with
  [SHADOW MODE] note; same action_id and 30-min cooldown apply
- alert_only events unaffected (always routed normally)
- 3 new tests: shadow on/off for ha_websocket_dead, alert-only unaffected
- DEPLOY.md with token gen, per-host config, verification, 48h observation,
  production-mode enablement, rollback
- README.md updated with shadow mode flag summary and DEPLOY.md link

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 17:12:33 +02:00
Oskar Kapala b9ed118b8c fix(telegram-bot): correct risk_level field + show description in alerts
- read risk_level with risk fallback (was: risk only → "unknown" for
  all actions written by supervisor which uses risk_level key)
- include description field in alert format (was: alert_only payloads'
  substance was invisible — description carried the full message)
- extract _format_pending_action() pure helper to enable unit testing
  without a live Telegram connection
- 8 tests: risk_level present, risk fallback, both absent, description
  shown/absent, truncation, full HA alert_only shape, no-description no-crash
- flagged during Phase 5 review of ha-diag-agent supervisor routing

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 16:26:49 +02:00
Oskar Kapala bf1415e4c1 feat(control-plane): route ha-diag-agent events through supervisor
- 8 HA event types mapped to existing action types
- ha_websocket_dead → container_restart (homeassistant), 30-min cooldown
- 6 events → alert_only (entity_unavailable, integration_failed,
  automation_failing, update_available, recorder_lag,
  system_health_degraded), 1-hour cooldown
- ha_websocket_recovered → cancels matching pending container_restart
- state-aware suppression: skip HA events when homeassistant has an
  active containers_not_running incident < 5 min ago (avoids alert
  storms during HA restarts/updates)
- location_tag preserved through action pipeline for per-house
  telegram alerts
- executor: alert_only acknowledged as no-op success
- 18 tests covering all 8 event types, suppression, cooldown,
  dedup, location_tag, recovery cancellation
- CLAUDE.md: supervisor event routing table added

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 15:59:23 +02:00
Oskar Kapala 31b48d162a feat(ha-diag-agent): WebSocketMonitor for real-time HA liveness
- persistent WS connection to HA with auth + state_changed subscription
- watchdog detects silence > 5min → emits ha_websocket_dead
- immediate ha_websocket_dead on disconnect, exponential reconnect with jitter
- cooldown prevents alert spam (10min repeat window while HA stays down)
- ha_websocket_recovered emitted on reconnect after a dead alert (allows
  supervisor to clear active incidents in Phase 5)
- new monitors/ subpackage for long-running tasks (vs interval checks/)
- /health endpoint now includes ws_connected field
- 26 unit tests, 3 integration tests (real HA + container stop/restart)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 15:00:18 +02:00
Oskar Kapala 3499b2f280 feat(ha-diag-agent): three REST diagnostic checks + Phase 3 flag fixes
New checks:
- SystemHealthCheck (15min interval): detects newly-failing HA
  integrations via /api/system_health snapshot diff; transition-based
  dedup (ok→error fires, sustained error silent, error→ok clears alert)
- UpdatesAvailableCheck (daily cron 09:00): per-update ha_update_available
  events with 7-day dedup; release notes truncated at 2000 chars
- UpdatesDigestCheck (Sunday cron 09:00): single digest event with all
  pending updates; weekly ISO-week dedup, independent of daily dedup key
- AutomationFailuresCheck (30min interval): detects automations with
  N consecutive failures (default 3) via /api/trace/automation/<id>;
  6h cooldown per automation

Phase 3 flag fixes:
- Flag #1 (since field): UnavailableEntitiesCheck now uses
  min(state.last_changed, baseline.first_seen) as effective "since",
  giving accurate duration when agent was offline at entity's first fail
- Flag #3 (registry cache): HAClient.get_entity_registry() caches
  response in-process with configurable TTL (default 300s); avoids
  repeated API calls across concurrent check cycles; invalidate_registry_cache()
  for manual invalidation

Storage: system_health_snapshot table (component, last_status, last_seen_at,
payload) created automatically on next Storage.open() call

Config additions (all with defaults): entity_registry_cache_ttl=300,
system_health_check_interval=900, automation_check_interval=1800,
automation_failure_threshold=3, updates_check_hour=9,
updates_check_minute=0, updates_cooldown_days=7

Tests: 95 unit tests pass (49 new), 13 integration tests pass (9 new);
3 skipped (live-HA token not set in CI)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 14:43:10 +02:00
Oskar Kapala f41ec5d0c5 docs: compress CLAUDE.md + fix zigbee2mqtt coordinator docs
- CLAUDE.md: collapsed 5-section deployment block to single annotated
  block, removed inline emit_event signatures (kept path + type list),
  flattened runtime path tree to bullets, condensed node table note to
  reference capabilities.yaml, added CHELSTY docker-compose v1
  constraint; 156 → 113 lines (~750 → ~480 tokens)
- fix: zigbee2mqtt/README.md updated to TCP coordinator (SLZB-06U at
  192.168.1.105:6638, ezsp); removed stale /dev/ttyACM0 USB reference
  and corrected owner node from piha to chelsty-infra

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 14:17:23 +02:00
Oskar Kapala 20f6761a67 feat(ha-diag-agent): UnavailableEntitiesCheck with root cause dedup
- shared aiohttp ClientSession in HAClient (Phase 1 Flag #2 fixed):
  make_session() factory, session injected at startup, closed on shutdown
- Check.run() → list[CheckResult]: clean multi-event interface
- first real diagnostic check: entity unavailable > 24h
  (INSERT OR IGNORE baseline preserves first-seen timestamp)
- root cause grouping: emit ha_integration_failed instead of N entity
  events when ≥50% of integration's entities are unavailable (≥3 min)
- alert deduplication via SQLite cooldown window (default 6h)
- recovery clears baseline + dedup for immediate re-alert
- configurable thresholds: duration, integration %, cooldown
- 38 unit tests + 7 integration tests (42 pass, 3 skip w/o live HA)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 13:41:55 +02:00
Oskar Kapala 07bd498fd6 feat(ha-diag-agent): test environment with dual HA Docker instances
- dockerized ken + chelsty HA test instances with template fixtures
- snapshot/reset/wait scripts for fixture management
- integration test infrastructure with separate marker
- location_tag promoted from metadata to event payload (Phase 1 flag #3)
- chelsty-infra target_url points to chelsty-ha via tailnet (Phase 1 flag #1)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 12:56:13 +02:00
Oskar Kapala 90c8e77bf7 chore: gitignore *.egg-info, remove committed egg-info
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 12:26:57 +02:00
Oskar Kapala ab8895d28b feat(ha-diag-agent): scaffold service with HA REST client and event emitter
- new per-host service, follows node-agent pattern
- 7 new HA event types defined (routing in supervisor — Phase 5)
- HeartbeatCheck as pipeline validator (pings /api/, emits ha_websocket_dead)
- service.yaml + host configs for piha (ken) and chelsty-infra (chelsty)
- test scaffolding with aiohttp/aiosqlite mocks (15/15 passing)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 12:26:34 +02:00
Oskar Kapala bd7f955e4e fix+debug(planner-agent): use base_url (not api_base) for litellm.acompletion, add print [TEMP]
litellm.acompletion() has base_url as a named param; api_base only works
via **kwargs fallback path. Switching to base_url ensures the value lands
correctly in completion_kwargs and reaches the ollama provider.

Print() added (not logger) so base_url is always visible in docker logs
regardless of log level.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 13:07:58 +02:00
Oskar Kapala 99200e6690 debug(planner-agent): log api_base before each litellm call [TEMP]
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 12:52:11 +02:00
Oskar Kapala dcacac6965 fix(planner-agent): rename OLLAMA_HOST → OLLAMA_API_BASE (litellm convention)
LiteLLM reads OLLAMA_API_BASE, not OLLAMA_HOST.
- llm_router.py: DEFAULT_OLLAMA_HOST → DEFAULT_OLLAMA_API_BASE, param ollama_host → ollama_api_base
- planner.py: env var os.getenv("OLLAMA_HOST") → os.getenv("OLLAMA_API_BASE"), param renamed accordingly
- /opt/homelab/config/planner-agent/.env on SOLARIA updated in-place (not in git)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 11:34:08 +02:00
Oskar Kapala e52b2e2259 fix(planner-agent): remove duplicate ANTHROPIC_API_KEY from environment
Key is already provided via env_file: /opt/homelab/config/planner-agent/.env

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 10:57:08 +02:00
Oskar Kapala 5ccdfa0ca6 docs: add planner-agent docs and session summary 2026-05-27
- services/planner-agent/README.md: full service doc (what it does,
  LLM fallback chain, env vars, deploy steps, local run, redis-cli
  end-to-end test, healthcheck)
- README.md: add Agent System section with all agents and their roles
- docs/sessions/2026-05-27-planner-agent.md: session summary (built
  files, architectural decisions, problems + solutions, deployment
  status, pending work)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 22:35:59 +02:00
Oskar Kapala ff6fda1f04 planner-agent: use env_file, keep only ANTHROPIC_API_KEY in environment
All runtime vars (REDIS_URL, OLLAMA_HOST, OLLAMA_MODEL, NODE_NAME,
COOLDOWN_SECONDS, RUNTIME_PATH) are sourced from the host-local
/opt/homelab/config/planner-agent/.env via env_file.
Only ANTHROPIC_API_KEY stays in environment (not in env_file — secret
injected at runtime by the operator when needed).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 22:27:44 +02:00
Oskar Kapala ca37fca5ce feat(planner-agent): main loop with LLM routing and HITL action proposals
services/planner-agent/src/planner.py:
- PlannerAgent: async Redis pub/sub on health_events + world_updates
- Pipeline: receive event → cooldown gate → LLMRouter → write pending action
  → emit remediation_started filesystem event
- CooldownTracker: 5-min suppression per svc_key (configurable via env)
- parse_event(): accepts node-agent shape A and world_updates shape B
- PROPOSAL_SCHEMA: jsonschema enforced by LLMRouter before accepting response
- SYSTEM_PROMPT: homelab topology + action rules (chelsty always requires_human,
  disk_pressure always notify, confidence<0.7 → requires_human)
- write_pending_action(): atomic tmp→rename write, executor-compatible format
- emit_event(): async wrapper around filesystem event write (no control-plane import)
- _emit_event_sync() reads NODE_NAME at call time (not import) for testability
- Benign events (service_healthy, node_online, ...) silently skipped
- LLM chain failure: no cooldown recorded so next event can retry

services/planner-agent/tests/test_planner.py (49 tests, 0 network):
- TestCooldownTracker: 7 tests (ready/not-ready/elapsed/reset/independence)
- TestHealthEvent, TestActionProposal, TestMapActionToExecutorType
- TestParseEvent: both event shapes, missing fields, timestamp formats
- TestBuildMessages: system prompt rules, payload inclusion
- TestPlannerHandleEvent: benign skip, cooldown block, ignore/restart/redeploy/
  notify proposals, remediation event emission, LLM failure isolation,
  requires_human propagation, cooldown recording, model name in proposal
- TestPlannerDispatch: valid JSON, invalid JSON, non-string data, missing node
- TestWritePendingAction, TestEmitEvent: filesystem integration with tmp_path

services/planner-agent/service.yaml:
  owner_node: solaria, dependencies: [redis, ollama]
services/planner-agent/docker-compose.yml: env + healthcheck
services/planner-agent/Dockerfile: python:3.11-slim
services/planner-agent/healthcheck.sh: heartbeat file age check (300s)
services/planner-agent/requirements.txt: litellm, redis, jsonschema, structlog

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 19:11:39 +02:00
Oskar Kapala 1bbc511bb7 feat(planner-agent): add llm_router.py with local-first fallback chain
services/planner-agent/src/llm_router.py:
- LLMRouter: async routing via litellm; chain = Qwen/Ollama → haiku → sonnet
- Timeouts: 8s local, 30s cloud; asyncio.wait_for belt-and-suspenders
- Rejection triggers: timeout, API error, refusal patterns, JSON schema fail
- JSON fence extraction: recovers valid JSON from  blocks
- ModelMetrics: per-model success/fallback/error counters + success_rate()
- Redis publish to 'llm_router_metrics' after every call (failure-safe)
- redis_url=None disables Redis (useful in tests / edge nodes)
- context= param adds caller label to all log lines for tracing

services/planner-agent/tests/test_llm_router.py:
- 34 tests, 0 network calls (litellm + Redis fully mocked)
- Covers: primary success, JSON error fallback, refusal fallback,
  timeout fallback, API exception fallback, all-fail RuntimeError,
  schema validation, fence extraction, metrics recording, Redis publish,
  Redis failure isolation

services/planner-agent/requirements.txt:
- litellm>=1.40.0, redis>=5.0.0, jsonschema>=4.21.0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 18:38:06 +02:00
Oskar Kapala 603e10a364 docs: session summary 2026-05-27 + update observer/control-plane/chelsty docs
docs/sessions/2026-05-27.md (new):
- Full session record: problems found, all commits shipped, end state
- Written in Polish per operator preference for session notes
- Known limitations: SLZB-06U offline, ezsp→ember migration pending

docs/observer-runtime.md:
- Document per-node checkpoint format (replaces old global checkpoint)
- Add service_healthy / service_recovered resolution behavior
- Document ghost key pruning (_prune_stale_world patterns)
- Add event type reference table (negative vs positive)

docs/vps-control-plane.md:
- Add container names and network_mode: host detail
- Document monitor:false, NODE_ALIAS_MAP, auto-cancel behavior
- Add piha agent-system materializer integration note
- Rewrite recovery section with actionable bootstrap-flood diagnosis
- Add action state machine (pending→approved→running→completed/cancelled)

docs/chelsty-runtime.md:
- Add chelsty-infra/chelsty-ha node table
- Document docker-compose v1 constraint (always use docker-compose, not docker compose)
- Add mosquitto network_mode:host + z2m extra_hosts:host-gateway explanation
- Add z2m config writable requirement (EROFS failure mode documented)
- Add chelsty-ha monitor:false rationale
- Add minimal configuration.yaml template for z2m

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 16:18:31 +02:00
Oskar Kapala 7277bdc27f Fix Copy for AI: materializer fetches from control-plane API instead of Redis
services/agent-system/runtime-materializer/materializer.py:
- Add materialize_from_api() that fetches all world-state endpoints
  from the control-plane HTTP API (CONTROL_PLANE_URL env var)
- When CONTROL_PLANE_URL is set, use API as source of truth instead of Redis
- Redis path preserved as fallback for backward compat

hosts/piha/runtime/agent-system/docker-compose.override.yml (new):
- Inject CONTROL_PLANE_URL=http://100.95.58.48:18180 for runtime-materializer
- piha webui /snapshot now mirrors VPS observer output (clean, ghost-free)

Root cause: materializer read from Redis which held 80 stale service entries
with hash-prefixed ghost keys (e.g. 0ccb8a88e079_control-plane-supervisor).
Redis is never updated by the current observer pipeline; the control-plane API
is the single authoritative world-state source.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 16:07:51 +02:00
Oskar Kapala b40b832159 Fix ghost service keys from hash-prefixed Docker container names
node-agent: use com.docker.compose.service label as canonical name
- Add _canonical_container_name() method: prefers compose label,
  falls back to hash-prefix-stripped c.name
- Replace bare c.name usage in check_containers()
- Skip 'created'-state containers (Docker stale-state artifacts)

observer: prune hash-prefixed ghost keys in _prune_stale_world()
- Each reconcile cycle removes service keys matching <node>/<12hex>_<name>
- Acts as safety net for entries already in services.json + future slippage

control-plane/docker-compose.yml already has explicit container_name on
all four services — no change needed there.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 15:41:13 +02:00
Oskar Kapala 28e9534765 observer: service_healthy resolves active incidents
service_healthy is a positive health confirmation — if the service had
an active incident (e.g. from earlier service_unhealthy events), that
incident should be resolved when the service is confirmed healthy.

Previously only service_recovered resolved incidents; service_healthy
set status=healthy but left incidents open, keeping status='degraded'.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 15:20:19 +02:00
Oskar Kapala 46ae92b5c1 supervisor: also cancel pending actions for services removed from desired state
Previously _cancel_resolved_pending_actions() only cancelled actions where
the service became healthy. This left orphaned actions when a service was
removed from services.yaml or marked monitor:false.

Add Case 1: if the action's svc_key is no longer in desired_state (either
removed entirely or skipped due to monitor:false), cancel with reason
service_removed_from_desired_state.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 15:19:13 +02:00
Oskar Kapala 410bfe7065 zigbee2mqtt: config goes in data dir (writable), not separate ro mount
z2m migrates configuration.yaml on startup and needs write access.
Remove the separate :ro config mount; rely on the base compose's
/opt/homelab/data/zigbee2mqtt/data:/app/data read-write mount instead.
configuration.yaml must exist at that path on the node before first run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 15:13:33 +02:00
Oskar Kapala b3912fe0ce zigbee2mqtt: use extra_hosts host-gateway instead of network_mode: host
docker-compose v1 cannot clear the ports list from the base compose with
ports: [] in an override, so network_mode: host caused InvalidArgument.

Use extra_hosts with host-gateway instead: maps 'mosquitto' hostname to the
Docker bridge gateway IP so mqtt://mosquitto:1883 reaches the host-networked
mosquitto process from within the bridge-networked z2m container.
Requires Docker 20.10+ (present on chelsty-infra).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 15:12:33 +02:00
Oskar Kapala 61e07f4318 zigbee2mqtt override: clear ports list for docker-compose v1 host network compat
docker-compose v1 (1.29.2 on chelsty-infra) raises InvalidArgument when
network_mode: host is combined with port_bindings from the base compose file.
Add ports: [] in the override to clear the base ports list.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 15:11:42 +02:00
Oskar Kapala 51002d4502 Fix pending actions: node_exporter, zigbee2mqtt, chelsty-ha monitoring
node_exporter (new service):
- Add services/node_exporter/docker-compose.yml matching solaria deployment
  (network_mode: host, pid: host, /:/host:ro,rslave mount)
- Add services/node_exporter/service.yaml

zigbee2mqtt chelsty-infra override:
- Fix network_mode: host (mosquitto runs on host network, port 1883 on localhost)
- Fix volume mount: ./configuration.yaml → absolute /opt/homelab/config/zigbee2mqtt/
  (secrets stay in runtime config dir, never in Git)
- Remove MQTT_USER/MQTT_PASSWORD (mosquitto uses allow_anonymous true)
- Extend healthcheck start_period to 60s (z2m takes time on first start)

chelsty-ha/services.yaml:
- Remove node-agent entry entirely (never deployed, no plans to bootstrap now)
- Keep homeassistant with monitor: false (no node-agent = no health events)

supervisor: respect monitor: false in services.yaml
- Skip action generation for services where monitor=false
- Cleans up chelsty-ha entries from action queue without removing desired-state docs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 15:10:48 +02:00
Oskar Kapala fb7828b52b supervisor: auto-cancel pending actions when drift is resolved
When a service becomes healthy (node-agent emits service_healthy → observer
updates services.json), any previously queued redeploy/container_restart
action is stale. Without cleanup, the queue accumulates old actions that
require manual rejection.

_cancel_resolved_pending_actions() runs after each reconcile cycle:
- Reads all pending/*.json with type=redeploy or container_restart
- If the service is now healthy in actual_state, moves action to cancelled/
  with reason=drift_resolved_auto
- Only pending actions are touched; approved/running are left to the operator

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 14:58:55 +02:00
Oskar Kapala 2f1965733f fix(node-agent): unique event IDs per service to prevent same-second overwrites
Multiple service_healthy (or containers_not_running) events emitted in the
same second for different containers shared the same filename pattern
evt-{node}-{ts}-{type}.json — the second write silently overwrote the first,
so the observer only ever saw the last container checked per event type per cycle.

Fix: include a sanitized service name slug in the ID so every event gets a
unique file, e.g. evt-vps-1234-service_healthy-node-agent.json.

Also adds import re (required for re.sub in the slug generation).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 14:55:22 +02:00
Oskar Kapala 267742c7d7 vps/node-agent: add network_mode: host for control-plane health probe
The _check_control_plane_health() method probes localhost:18180, which
is the control-plane's mapped port. Inside a bridged container, localhost
resolves to the container's own loopback — the probe always fails.

host network mode shares the VPS host's network namespace so that
localhost:18180 correctly reaches the control-plane.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 14:52:32 +02:00
Oskar Kapala 4e8968f9c7 Fix service health tracking: emit service_healthy, control-plane endpoint check, cleanup checkpoint migration
- node_agent: emit service_healthy for all running managed containers so
  observer populates services.json (previously empty → supervisor flooded
  action queue with missing_service redeploys for healthy services)
- node_agent: VPS-only _check_control_plane_health() probes the HTTP
  endpoint to emit service_healthy/unhealthy for the 'control-plane' logical
  service (multi-container stack, container names don't match service name)
- node_agent: fix _cleanup_control_plane_fs() to read new node_checkpoints
  format from observer checkpoint (was reading old last_processed_file key,
  always found nothing, never cleaned up old events)
- observer: handle service_healthy event type → sets service status healthy
  without resolving incidents (unlike service_recovered which also resolves)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 14:49:56 +02:00
Oskar Kapala f4a8db93e4 fix(observer): per-node-directory checkpoints replace single global checkpoint
The old mechanism tracked a single 'last_processed_file' and used sorted
filename order to find new events.  Remote nodes ship events into
subdirectories (events/piha/, events/chelsty-infra/) that sort
alphabetically BEFORE the VPS directory (events/vps/).  Once the
checkpoint pointed to a vps/ file, all piha/ and chelsty-infra/ events
were silently skipped forever.

New mechanism:
- node_checkpoints: {node_dir: last_processed_path}
- Each node directory has its own independent cursor
- New events = files whose path > that node's checkpoint
- Backward-compatible: old 'last_processed_file' is migrated by extracting
  the node dir from the path on first load

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 14:16:58 +02:00
Oskar Kapala a5a3e223dc fix(node-agent): skip SSH config file in rsync to avoid UID ownership errors
When ~/.ssh is mounted from the host oskar user into a container that
runs as root, OpenSSH rejects ~/.ssh/config with 'Bad owner or
permissions' because the file UID doesn't match the running process.

Add -F /dev/null to the rsync SSH command to skip the config file
entirely.  Also add UserKnownHostsFile=/dev/null so no known_hosts
write is attempted into a potentially read-only mounted .ssh dir.
The key itself (/root/.ssh/id_rsa) is still read as an implicit
default identity and is not affected by -F.

Reproduces on chelsty-infra (has ~/.ssh/config); safe for all nodes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 14:12:19 +02:00
Oskar Kapala 2349de518b fix(node-agent): correct VPS_EVENTS_HOST to actual VPS Tailscale IP
100.108.208.3 is piha's Tailscale IP (piha hosts Forgejo+Redis).
VPS's actual Tailscale IP is 100.95.58.48.  All three node-agent
overrides were pointing at piha itself, causing containers to SSH
to their own host and fail auth.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 14:07:27 +02:00
Oskar Kapala 65bac4ebfe fix(node-agent): mount host SSH key into container for event shipping
Nodes ship events to VPS via rsync+SSH. The container runs as root
and uses the default SSH identity, which must be at /root/.ssh/.
Mount /home/oskar/.ssh from the host read-only so the existing
authorized key is available inside the container.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 13:59:28 +02:00
Oskar Kapala 96bf32614f fix(observer+operator-ui): fix stale world state, dict→list API, event time filter
Root cause of stale data:
- node_agent.py falls back to socket.gethostname() when NODE_NAME is unset.
  Inside a Docker container this returns the 12-char container ID (e.g.
  'be17cb6eb0f6'), not the host name.  Observer ingested those events and
  created ghost entries in world/nodes.json that never expired.

observer.py:
- _prune_stale_world(): removes node/service/incident entries for nodes absent
  from topology inventory; called on every run_once() cycle (both new-events
  and idle paths).  Resolved incidents older than 7 days are also aged out.
- _save_world(): now writes node_count and service_count to runtime-summary.json
  so the Dashboard's System Overview cards show real numbers instead of undefined.

operator_ui.py:
- current_nodes/services/deployments/incidents(): the observer stores world state
  as keyed dicts; the frontend calls .map() which requires an array.  All four
  functions now convert the dict to a properly-shaped list.  Each item has the
  fields the Nodes, Services, Topology, Deployments, and Correlation views expect
  (hostname, health, capabilities, desired_state, dependencies, etc.).
- current_incidents(): synthesises a human-readable 'message' field from node +
  service + trigger_type (observer does not store one; dashboard showed undefined).
- current_events(): adds a 24 h time filter (EVENTS_MAX_AGE_HOURS env var,
  default 24).  Without this, every event file ever written was returned,
  including events from ghost-node deploys.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 13:51:03 +02:00
Oskar Kapala ae33cce889 feat(node-agent): add runtime overrides for piha, solaria, chelsty-infra
- piha: NODE_TYPE=sd_card (rate-limited docker prune, once per day)
- solaria: NODE_TYPE=ai_node (dangling+containers+build cache; never -a to preserve Ollama images)
- chelsty-infra: NODE_TYPE=lte_node (NO cleanup, events-only)
- All three: VPS_EVENTS_HOST set for event shipping via rsync+SSH

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 13:34:23 +02:00
Oskar Kapala c5c080b3e3 feat(vps): add node-agent runtime override with NODE_NAME=vps
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 13:18:19 +02:00
Oskar Kapala 01b7758fe6 feat(node-agent): implement health monitor and safe cleanup policy
scripts/monitor/health-monitor.sh (new):
- Standalone bash health monitor: disk/RAM/CPU checks + docker container health
- Per-node-type cleanup policy enforced:
    lte_node  (chelsty-infra, chelsty-ha): NO cleanup, no docker ops
    sd_card   (piha, saturn): dangling images + containers, rate-limited once/24h
    ai_node   (solaria): dangling + containers + build cache, NEVER -a
    standard  (vps): dangling + containers + build cache + CP filesystem rotation
- VPS filesystem rotation: completed/failed actions >7d, deploy logs >30d,
  events >3d AND past observer checkpoint
- Emits structured JSON events (node_health, disk_pressure, high_memory, high_cpu,
  containers_not_running, healthcheck_failed)

services/node-agent/ (new):
- Python daemon (node_agent.py): same policy as bash script, Docker SDK
  for container checks and cleanup, /proc for system metrics
- Optional event shipping to VPS via rsync+SSH (VPS_EVENTS_HOST env var)
- Dockerfile: python:3.11-slim + openssh-client + rsync + docker>=6.0
- docker-compose.yml: mounts docker socket, /opt/homelab, repo read-only

observer.py:
- Handle node_health: update node status + disk/mem/cpu metrics, clear disk_pressure
- Handle disk_pressure: record severity on node, clear when healthy
- Handle high_memory / high_cpu: record pressure level for correlation

supervisor.py:
- Add NO_DISK_CLEANUP_NODES = {chelsty-infra, chelsty-ha}
- reconcile() step 3: generate disk_cleanup actions for nodes with high disk pressure
- _generate_disk_cleanup_recommendation(): stable ID disk-cleanup-{node},
  checks all active states, risk=guarded (operator approval required)

executor.py:
- Handle disk_cleanup action type via _execute_disk_cleanup()
- Commands come from action payload; safety gate rejects any command touching
  /opt/homelab/data/, /opt/homelab/config/, /opt/homelab/state/, or rm -rf /

hosts/*/services.yaml:
- Rename stability-agent -> node-agent on piha, vps, solaria, chelsty-infra
- Add node-agent to chelsty-ha (previously missing)
- Add cleanup policy notes to LTE node comments

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 13:15:06 +02:00
Oskar Kapala 7742bda245 feat(control-plane): add container_restart remediation
- observer: store trigger_type on incidents for supervisor routing
- supervisor: route containers_not_running/mqtt_unreachable to container_restart instead of redeploy
- supervisor: fix node alias normalization via NODE_ALIAS_MAP
- supervisor: fix pending action dedup (scan by content not filename)
- executor: implement container_restart via SSH docker restart with retry
- control-plane override: configure NODE_ALIAS_MAP for production

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 12:50:46 +02:00
oskar 98fe1f1846 fix: frigate config not read-only, mount from /opt/homelab 2026-05-22 11:31:31 +02:00
oskar beb8b5cbaa fix: remove --pull always flag incompatible with docker-compose v1 2026-05-21 22:07:49 +02:00
oskar 898deda05f fix: deploy-frigate.sh use docker-compose v1 for chelsty-infra 2026-05-21 22:05:43 +02:00
oskar f34399a30d feat: add Frigate NVR deployment for chelsty-infra
VAAPI decode via Intel UHD 630, CPU detection, 2x Reolink RLC-540
placeholders. MQTT to local mosquitto (127.0.0.1), 7-day recording
retention. Secrets in /opt/homelab/config/frigate/frigate.env on node.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 18:19:45 +02:00
oskar 9b39581b53 fix(supervisor): content-based action IDs to prevent 30s backlog accumulation
Timestamp in reconcile-{ts}-{node}-{service} meant dedup guard never fired.
Switch to reconcile-{node}-{service} and check pending/approved/running states.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 17:47:37 +02:00
oskar ae7446a04b feat: add Copy for AI snapshot button to webui
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 12:05:37 +02:00
oskar f21be4f4d4 ops: align vps desired state with control-plane architecture, remove legacy agent-system references
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-21 11:40:55 +02:00
oskar 8fb4d3d634 docs: add tech-debt.md, forgejo_runner temp disabled 2026-05-21 10:37:42 +02:00
oskar 35e57cc789 docs(CLAUDE.md): update node model and override path convention
- split CHELSTY into CHELSTY-INFRA and CHELSTY-HA in node roles table
- correct docker-compose override path to hosts/<node>/runtime/<service>/

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 15:27:46 +02:00
oskar b02c8bb50e fix(deploy): inventory-aware orchestration and correct override paths
- orchestrate-deploy.sh: read nodes from inventory/topology.yaml instead of hardcoded list
- orchestrate-deploy.sh: LTE nodes (chelsty-infra, chelsty-ha) use ConnectTimeout=30, non-fatal on failure
- deploy-node.sh: service discovery falls back to services.yaml if no services.txt
- deploy-node.sh: override path corrected to hosts/<node>/runtime/<service>/

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 14:50:01 +02:00
oskar dc483ae31a docs(chelsty): update docs and topology for site/node split
- chelsty-runtime.md: references chelsty-infra and chelsty-ha nodes
- chelsty-stability-agent.md: scoped to chelsty-infra
- topology.yaml: chelsty monolith replaced with chelsty-infra + chelsty-ha
2026-05-20 14:23:57 +02:00
oskar 9d2f748557 refactor(hosts): split chelsty monolith into chelsty-ha and chelsty-infra
- remove legacy hosts/chelsty/ monolith
- chelsty-infra: add capabilities, networking, paths, runtime (mosquitto, zigbee2mqtt, stability-agent)
- chelsty-ha: add capabilities
- align with site/node model

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-20 14:20:49 +02:00
oskar 8a12b7ff17 docs: uzupelnij dokumentacje pod katem agentow AI
Co-authored-by: Junie <junie@jetbrains.com>
2026-05-20 12:06:23 +02:00
oskar f65698925e Fix control plane SSH deploy TTY 2026-05-18 21:41:47 +02:00
oskar 9f20dcae05 Add control plane deploy script and fix UI healthcheck 2026-05-18 21:34:57 +02:00
oskar b7251ac416 Fix control plane UI healthcheck 2026-05-18 21:29:55 +02:00
oskar 807b097eb4 Fix Telegram bot job queue dependency 2026-05-18 20:22:12 +02:00
oskar 5754994f8e Refactor Telegram bot to use control plane API 2026-05-17 23:42:52 +02:00
oskar c299a2cb85 Fix agent fleet verification via Redis container 2026-05-17 23:00:51 +02:00
oskar b129f03837 Fix stability agent fleet deploy scripts 2026-05-17 21:09:06 +02:00
oskar b7faac00c5 Add executable stability agent fleet deploy scripts 2026-05-17 17:32:10 +02:00
oskar 8f305ba3df Merge VPS control plane deployment and observer runtime 2026-05-17 17:30:04 +02:00
oskar c9ddfa9ac1 Roll out stability agent to homelab nodes 2026-05-17 15:54:19 +02:00
oskar 3233cf07cd Add Telegram approval bot for agent actions 2026-05-16 21:53:06 +02:00
oskar ac90acfac8 Merge Agent System UI runtime pipeline 2026-05-16 21:38:48 +02:00
oskar 12a775c834 Finish repo-first implementation of Agent System UI pipeline
Co-authored-by: Junie <junie@jetbrains.com>
2026-05-16 19:36:43 +02:00
oskar 41c05f42b5 Add agent system service with Redis materializer 2026-05-15 23:29:59 +02:00
oskar e8d6d6d473 Publish stability agent state to Redis 2026-05-15 22:52:12 +02:00
oskar 8d0f2379ba Add CHELSTY stability agent 2026-05-15 18:51:45 +02:00
oskar 90b2a5d0e9 Add Zigbee coordinator backup 2026-05-14 18:24:26 +02:00
oskar b726048d41 Adapt zigbee2mqtt for SLZB coordinator 2026-05-14 16:37:18 +02:00
Oskar Kapala 533b8e846d Add heartbeat updates and improve health checks in control-plane components 2026-05-12 20:59:46 +02:00
Oskar Kapala f4e6871d76 Add health check to control-plane Dockerfile fix syntax 2026-05-12 20:28:13 +02:00
Oskar Kapala 793559a4b5 Add health check to control-plane Dockerfile 2026-05-12 20:25:01 +02:00
Oskar Kapala 0cf1106b34 Update control-plane port mapping to 18180 2026-05-12 20:22:46 +02:00
Oskar Kapala 2029457f57 Implement VPS control-plane deployment profile 2026-05-12 20:19:05 +02:00
176 changed files with 18548 additions and 530 deletions

View file

@ -0,0 +1,43 @@
---
name: deploy
description: Deploy, redeploy, or ship homelab services to a target node. Trigger on any request containing deploy / redeploy / wdróż / zredeployuj / ship for targets control-plane, vps, piha, solaria, or chelsty-infra.
---
Always invoke `scripts/deploy/deploy.sh <target> [--dry-run] [--no-gate]` as the **sole entry point**.
Never call `deploy-control-plane.sh`, `deploy-node.sh`, or `deploy-local.sh` directly.
## Targets
| Target | What it deploys |
|---|---|
| `control-plane` | observer, supervisor, executor, operator-ui on VPS |
| `vps` | all VPS GitOps services (node-agent, npm, outline, joplin, ai-cluster, …) |
| `piha` | PIHA services (ha-diag-agent, node-agent, redis, …) |
| `solaria` | SOLARIA compute services |
| `chelsty-infra` | CHELSTY LTE edge node (30 s SSH timeout) |
## Invocation
```bash
scripts/deploy/deploy.sh <target> # full pipeline
scripts/deploy/deploy.sh <target> --dry-run # preflight + gate only
scripts/deploy/deploy.sh <target> --no-gate # emergency: bypass tests
```
## Exit Code Handling
| Code | Meaning | Required action |
|---|---|---|
| 0 | Success | Report: target, commit hash, gate status, verify status, elapsed time |
| 1 | Preflight failed | Fix the upstream issue (push commits, wake node, switch to master). Never bypass. |
| 2 | Gate failed | Show exactly which test/build failed. Do **not** deploy. Fix the failure first. |
| 3 | Execute failed | Show full deploy output. Ask user whether to investigate or rollback. |
| 4 | Verify failed | Show docker ps output. Discuss rollback with the user. |
| 5 | Sudo handoff | Print the exact manual command from stderr **verbatim** and stop. User must run it. |
## Rules
- Never pass `--no-gate` unless the user explicitly requests emergency/bypass mode.
- Never deploy uncommitted or unpushed code — preflight enforces this; do not help circumvent it.
- Canonical branch is `master` — preflight enforces this.
- For exit 5: reproduce the handoff command exactly as printed to stderr, then stop.

View file

@ -0,0 +1,65 @@
---
name: save-session
description: Save and record the current work session to docs/sessions/. Trigger ONLY on explicit "save session", "zapisz sesję", or "wrap up" — never invoke proactively between tasks.
---
**Trigger condition**: user explicitly says "save session", "zapisz sesję", "wrap up", or equivalent.
Never invoke proactively. Never invoke mid-task.
## 1. Determine Session Boundary
1. Read the latest entry file in `docs/sessions/` — use its last `## Session HH:MM` heading timestamp as the start boundary.
2. Fallback if no previous entry exists: 24 hours ago.
## 2. Collect Facts (deterministic only — no invention)
Run exactly:
```bash
# All commits since boundary
git --no-pager log --oneline <boundary>..HEAD
# Changed file summary
git --no-pager diff --stat <boundary>..HEAD
```
From the visible conversation transcript: deploys run and their outcomes, test results seen.
## 3. Write the Session Entry
**APPEND** to `docs/sessions/YYYY-MM-DD.md` (create the file if it doesn't exist for today).
Never overwrite existing content.
```markdown
## Session HH:MM
### Commits
<output of git log --oneline>
### Files changed
<output of git diff --stat>
### Deploys
<list from transcript, or "None recorded">
### Narrative
> _user-provided summary_
```
The `> _user-provided summary_` placeholder is **mandatory**. Never fill it in. The user supplies the narrative separately if desired.
## 4. What NOT to Touch
- `backlog.md` — only on explicit "update backlog" instruction
- `CLAUDE.md` — only on explicit "update CLAUDE.md" instruction
- Any other file not listed above
## 5. Commit
Stage and commit **only** the session file:
```bash
git add docs/sessions/YYYY-MM-DD.md
git commit -m "docs: session YYYY-MM-DD HH:MM"
```
No other files. No `git add -A`.

View file

@ -0,0 +1,81 @@
---
name: worktree-aware
description: >
Use when working in a git worktree checkout for a parallel agent task.
The presence of an .agent-task file in the current working directory indicates
a task worktree (NOT the main checkout). Encodes branch hygiene: commit only
to the assigned task branch, NEVER push origin master, NEVER touch the main
checkout at ~/homelab-codex-ws, NEVER manage worktrees yourself. On task
completion, report the branch name verbatim and stop — the human merges via
scripts/dev/agent.sh.
---
## When this applies
- `.agent-task` present in your `cwd` → you are in a task worktree. Apply all rules below.
- `.agent-task` absent → you are in the main checkout. Do NOT treat yourself as a task agent.
In the main checkout these rules do not apply.
## Reading the marker
`.agent-task` is a YAML file. Your assigned branch is the value of the `branch:` key, e.g.:
```yaml
task: my-feature
branch: task/my-feature
parent_commit: abc1234
created_utc: 2026-06-03T10:00:00Z
worktree_path: /home/oskar/homelab-codex-ws-my-feature
```
Always read this file first before taking any action.
## Rules
1. **Commit only to your branch.**
Before any `git commit`, run `git status` and confirm it says `On branch task/<name>`.
If it does not, stop immediately and report the discrepancy.
2. **Push only to your branch.**
The only permitted push is `git push origin task/<name>`.
NEVER `git push origin master` or any other branch.
3. **Do not touch the main checkout.**
`~/homelab-codex-ws/` is the main checkout — deploy-only, owned by the human.
Do not read from, write to, or execute commands inside it.
4. **Stay scoped.**
Only change files directly related to your assigned task.
If you notice other problems, report them in your final summary as separate follow-up proposals.
Do not fix them in this worktree.
5. **Never `git add -A`.**
Always stage specific files by name: `git add path/to/file`.
6. **Do not manage worktrees.**
Never run `git worktree add/remove` or invoke `scripts/dev/agent.sh`.
Worktree lifecycle is the human's responsibility.
7. **Final report before stopping.**
When the task is done, provide a structured report containing:
- Files changed (path and one-line summary of change)
- Tests run and results
- All commit hashes on the task branch
- **Branch name verbatim** (copy-paste ready)
- Follow-up items as bulleted proposals for separate tasks
## Definition of Done
- All commits are on `task/<name>` (verify with `git log --oneline master..task/<name>`)
- Test suite passes
- Branch pushed: `git push origin task/<name>`
- Full report delivered in conversation
## What you do NOT do
- Merge branches
- Create or push tags
- Run deploys or healthchecks against production nodes
- Delete branches or worktrees
- Modify files in other worktrees
- Push to `origin master` under any circumstances

1
.gitignore vendored
View file

@ -15,6 +15,7 @@ __pycache__/
*$py.class *$py.class
venv/ venv/
.venv/ .venv/
*.egg-info/
# Tools # Tools
.aider* .aider*

194
CLAUDE.md Normal file
View file

@ -0,0 +1,194 @@
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## What This Repo Is
GitOps-lite orchestration for a distributed homelab. The repo is the source of truth for infrastructure definitions; runtime state lives at `/opt/homelab/` on each execution node and is never committed.
## Node Roles
| Host | Role |
|------|------|
| **SATURN** | Primary control node — only node where commits are made |
| **SOLARIA** | GPU/compute/AI workloads |
| **PIHA** | Infra, monitoring |
| **VPS** | Public ingress, reverse proxy, control plane host |
| **CHELSTY-INFRA** | LTE edge hypervisor (site: chelsty); Zigbee2MQTT, Mosquitto, stability-agent — offline-first |
| **CHELSTY-HA** | LTE Home Assistant VM (site: chelsty); connects to CHELSTY-INFRA MQTT broker — offline-first |
All nodes communicate over Tailscale. CHELSTY-INFRA and CHELSTY-HA have an intermittent LTE uplink; their services must never depend on SATURN, VPS, or Forgejo at runtime. Full node capabilities: `hosts/<node>/capabilities.yaml`.
## Deployment
```bash
scripts/deploy/deploy.sh # fresh deploy on current node
scripts/deploy/deploy.sh --resume # resume after interruption
scripts/deploy/deploy.sh --stage verify # specific stage only
scripts/deploy/deploy.sh --service mosquitto # specific service only
./scripts/deploy/deploy-control-plane.sh --ssh # SATURN/SOLARIA → VPS
./scripts/deploy/deploy-node.sh chelsty-infra # CHELSTY nodes (individually)
./scripts/bootstrap/prepare-node.sh # general node bootstrap
./scripts/bootstrap/chelsty-runtime.sh # CHELSTY-specific bootstrap
```
Pipeline stages: **prepare → validate → deploy → verify → diagnose (on failure) → complete**. Stage state persisted in `/opt/homelab/state/deploy/`.
## Service Structure
Every service must follow this layout:
```
services/<service>/
├── docker-compose.yml
├── service.yaml # Machine-readable contract (primary source of truth for agents)
├── README.md
├── env.example # Template — never commit actual secrets
└── healthcheck.sh # Returns 0 (healthy) or 1 (unhealthy)
```
`service.yaml` defines `owner_node`, `exposure`, `dependencies`, `healthcheck`, `restart_policy`, `persistence.paths`, and `runtime.env_vars`. This is what AI agents read to understand how to manage a service.
Host-specific runtime config and secrets live at `/opt/homelab/config/<service>/` on the target node (not in Git). Docker Compose overrides are version-controlled at `hosts/<node>/runtime/<service>/docker-compose.override.yml` in this repo and applied during deployment.
## Agent System Architecture
The platform uses a multi-agent model with **human-in-the-loop** for destructive actions:
1. **Stability Agent** (`services/stability-agent/`) — Per-node watchdog. Monitors Docker containers, disk, Tailscale, MQTT. Emits filesystem events. Does NOT restart services autonomously.
2. **Observer** (`services/control-plane/src/`) — Synthesizes world state from events into `/opt/homelab/world/{nodes,services,deployments,incidents}.json`.
3. **Supervisor** — Detects drift between desired state (from `hosts/*/services.yaml`) and actual state (from Observer output). Writes `pending` action JSON files.
4. **Executor** — Executes actions only after they transition to `approved`.
5. **Operator UI** + **Telegram Bot** — Operators review and approve/reject pending actions.
### Action approval flow
```
Agent → /opt/homelab/actions/pending/<id>.json
→ Telegram notification → Operator approves
→ /opt/homelab/actions/approved/<id>.json
→ Executor runs → completed / failed
```
Agents must never execute destructive actions (restarts, deploys, config changes) without a corresponding approved action file.
## Event System
Events are append-only JSON lines at `/opt/homelab/events/YYYY-MM-DD/<node>/events.jsonl`.
Emit via `scripts/lib/events.sh` (shell) or `scripts/lib/events.py` (Python).
Normalized event types: `deployment_started/completed/failed`, `service_unhealthy/recovered`, `node_offline/online`, `healthcheck_failed`, `remediation_started/completed`.
### Supervisor event routing table
| Event type | Source | Action generated | Cooldown |
|---|---|---|---|
| `containers_not_running` | stability-agent | `container_restart` | dedup via stable ID |
| `mqtt_unreachable` | stability-agent | `container_restart` | dedup via stable ID |
| `service_unhealthy` / other | stability-agent | `redeploy` | dedup via stable ID |
| `disk_pressure` (high) | stability-agent | `disk_cleanup` | dedup via stable ID |
| `ha_websocket_dead` | ha-diag-agent | `container_restart` (homeassistant) | 30 min after completion |
| `ha_websocket_recovered` | ha-diag-agent | cancels matching restart | — |
| `ha_integration_failed` | ha-diag-agent | `alert_only` | 1 hour |
| `ha_entity_unavailable_long` | ha-diag-agent | `alert_only` | 1 hour |
| `ha_automation_failing` | ha-diag-agent | `alert_only` | 1 hour |
| `ha_update_available` | ha-diag-agent | `alert_only` | 1 hour |
| `ha_recorder_lag` | ha-diag-agent | `alert_only` | 1 hour |
| `ha_system_health_degraded` | ha-diag-agent | `alert_only` | 1 hour |
HA events are routed directly from the events directory by the supervisor (not via world-state drift loop) to avoid conflicts with stability-agent's independent container health tracking. HA events are suppressed if `homeassistant` had a `containers_not_running` incident within the last 5 minutes (planned restart/update in progress).
## Discovery Entry Points for Agents
When exploring the system, use these files in order:
1. `inventory/topology.yaml` — node list, roles, mesh type
2. `hosts/<node>/capabilities.yaml` — hardware and software constraints
3. `hosts/<node>/services.yaml` — desired services and exposure classes for that host
4. `services/<service>/service.yaml` — operational contract for a service
## VPS-Specific Rules
VPS has **4 GiB RAM, no swap**. Every repo-managed service must declare memory limits in its `hosts/vps/runtime/<service>/docker-compose.override.yml`.
### Memory limit convention
Use top-level Compose properties (not `deploy.resources.limits`, which requires Swarm mode):
```yaml
services:
myservice:
mem_limit: 256m # cgroup ceiling; Docker restarts on breach
oom_score_adj: -900 # host kernel OOM-killer will not pick this container
```
Rules:
- **Control-plane containers** (executor, observer, supervisor, operator-ui), **node-agent**, **stability-agent**: always set `oom_score_adj: -900` — these must never be a system-level OOM victim.
- `mem_limit` still applies even with `oom_score_adj: -900`; the cgroup OOM killer is independent of the host OOM killer and will restart the container via Docker when the limit is exceeded.
- Budget: OS+Docker reserves ~800 MiB; sum of all `mem_limit` values must stay ≤ 3200 MiB (3.1 GiB).
### Repo-managed services on VPS
All VPS services are now GitOps-managed. Service definitions live in `services/<name>/docker-compose.yml`; host-specific overrides (mem_limit, env) live in `hosts/vps/runtime/<name>/docker-compose.override.yml`.
| Service | Compose stack | Data path |
|---|---|---|
| npm | `services/npm/` | `/home/dockeruser/docker/npm/{data,letsencrypt}` (bind mount) |
| outline | `services/outline/` | Docker named volumes: `outline_outline_storage`, `outline_postgres_data`, `outline_redis_data` |
| joplin | `services/joplin/` | Docker named volume: `joplin_postgres_data` |
| ai-cluster | `services/ai-cluster/` | Mosquitto config bind: `/home/dockeruser/docker/ai-cluster/mosquitto/` |
**Data migration rule**: data paths stay in place at cutover. Never move volumes or bind-mount sources without a dedicated migration plan.
**Cutover checklist** (before running `docker compose up` for any migrated service):
1. `git pull` on VPS
2. Populate `/opt/homelab/config/<service>/.env` from the `env.example` template
3. For ai-cluster: copy `/home/dockeruser/docker/ai-cluster/.env` to `/opt/homelab/config/ai-cluster/.env`
4. For mosquitto: config stays at old bind path until explicitly migrated
5. Verify named volumes exist: `docker volume ls | grep <project>`
**ai-cluster architectural note**: compute workloads (codex-worker, planner-worker) belong on SOLARIA (GPU/compute node), not the 4 GB ingress VPS. Migrate when feasible; for now, hard mem_limits contain the blast radius.
## CHELSTY-Specific Rules
- Zigbee coordinator is **SLZB-06U** over TCP (`192.168.1.105:6638`, `ezsp` adapter). Never use `/dev/ttyUSB0`.
- CHELSTY nodes run **docker-compose v1** (1.29.2) — use `docker-compose` (hyphenated), not `docker compose`.
- Critical backup sets: HA config+data, Zigbee2MQTT config+db+network key, Mosquitto config+persistence, SLZB-06U coordinator state.
## Runtime Path Conventions
`/opt/homelab/` layout on each node:
- `data/<service>/` — persistent volumes
- `config/<service>/` — secrets and host-local overrides (not in Git)
- `logs/<service>/` — service logs
- `state/` — deployment stage markers, agent heartbeats
- `events/` — append-only event store
- `world/` — Observer output (synthesized state)
- `actions/` — pending / approved / running / completed / failed
## Definition of Done (serwisy)
Before any new or changed service is considered ready:
1. **docker build + smoke run** — build the image locally and run it for a few seconds; confirm the process starts its main loop without crashing. This catches packaging/import errors (e.g. `ModuleNotFoundError`) before they reach a node.
2. **pytest** — run the service's test suite. If no tests exist yet, add a minimal one (at minimum: import passes, core logic has at least one case). Tests live in `services/<service>/tests/`.
3. **Never commit or deploy code that has never been run.** If a smoke run or test fails, fix it first.
## Naming Conventions
- Hosts: ALL CAPS (`SATURN`, `PIHA`)
- Services: kebab-case (`stability-agent`, `zigbee2mqtt`)
- Container names must match service names
- Always `restart: unless-stopped` unless `service.yaml` says otherwise
## Multi-agent worktree mode
`~/homelab-codex-ws` (main checkout) is **deploy-only** and belongs to the human operator.
Parallel agent tasks run in isolated git worktrees created by `scripts/dev/agent.sh new <name>`.
If `.agent-task` exists in your current working directory, you are in a task worktree.
**You must immediately read `.agent-task` and load `.claude/skills/worktree-aware/SKILL.md`
before taking any action.** That skill defines all branch-hygiene rules for task worktrees.
Worktree lifecycle commands: `agent.sh new | list | merge | clean`.
Agents never invoke these — only the human does.

View file

@ -13,6 +13,22 @@ The homelab consists of several nodes connected via a Tailscale internal mesh.
| **PIHA** | Infra Node | Core infrastructure services, automation, and monitoring. | | **PIHA** | Infra Node | Core infrastructure services, automation, and monitoring. |
| **VPS** | Edge Node | Public ingress, reverse proxy, and edge services. | | **VPS** | Edge Node | Public ingress, reverse proxy, and edge services. |
## Agent System
The homelab uses a multi-agent orchestration model with human-in-the-loop for destructive actions:
| Agent | Node | Role |
|-------|------|------|
| **stability-agent** | all nodes | Per-node watchdog — monitors Docker, disk, Tailscale, MQTT; emits events |
| **node-agent** | all nodes | Publishes container health events to Redis pub/sub |
| **observer** | VPS | Synthesizes world state from events into `/opt/homelab/world/*.json` |
| **supervisor** | VPS | Detects drift between desired and actual state; writes `pending` actions |
| **planner-agent** | SOLARIA | LLM-powered diagnosis — listens to Redis, proposes remediation actions |
| **executor** | VPS | Executes actions only after operator approval |
| **operator-ui** + **telegram-bot** | VPS / PIHA | Operator reviews and approves/rejects pending actions |
Action approval flow: `pending/` → operator approves → `approved/` → executor runs.
## Repository Structure ## Repository Structure
- `docs/`: [Infrastructure Standards](docs/standards.md) and [Deployment Conventions](docs/deployment.md). - `docs/`: [Infrastructure Standards](docs/standards.md) and [Deployment Conventions](docs/deployment.md).
@ -29,10 +45,13 @@ The homelab consists of several nodes connected via a Tailscale internal mesh.
## Documentation Index ## Documentation Index
- [Infrastructure Standards](docs/standards.md) - [Infrastructure Standards](docs/standards.md)
- [Agent Operating Procedures](docs/agents.md) (For AI/Non-Human Agents)
- [Deployment Conventions](docs/deployment.md) - [Deployment Conventions](docs/deployment.md)
- [Hardware](docs/hardware.md) - [Hardware](docs/hardware.md)
- [Networking](docs/networking.md) - [Networking](docs/networking.md)
- [Services](docs/services.md) - [Services](docs/services.md)
- [Node Capabilities](docs/capabilities.md)
- [Action Model](services/agent-system/action-model.md)
--- ---
*Note: This repository documents the state of the homelab. Runtime state lives outside the repository in `/opt/homelab`.* *Note: This repository documents the state of the homelab. Runtime state lives outside the repository in `/opt/homelab`.*

View file

@ -0,0 +1,31 @@
{
"metadata": {
"format": "zigpy/open-coordinator-backup",
"version": 1,
"source": "zigbee-herdsman@10.0.7",
"internal": {
"date": "2026-05-14T14:48:35.098Z",
"znpVersion": 1
}
},
"stack_specific": {
"zstack": {
"tclk_seed": "32d69cbe3f0e15471e5d43f9401e485a"
}
},
"coordinator_ieee": "00124b00257bf416",
"pan_id": "46bc",
"extended_pan_id": "087730b5f614ea4a",
"nwk_update_id": 0,
"security_level": 5,
"channel": 11,
"channel_mask": [
11
],
"network_key": {
"key": "049909949a950d91522cf10cc369a724",
"sequence_number": 0,
"frame_counter": 0
},
"devices": []
}

49
docs/agents.md Normal file
View file

@ -0,0 +1,49 @@
# Agent Operating Procedures
This document defines the operating procedures, constraints, and interaction protocols for non-human agents (AI agents, autonomous scripts) within the Homelab Codex ecosystem.
## 1. Core Principles for Agents
1. **Read-Only by Default**: Agents should assume read-only access to the `/opt/homelab` runtime unless explicitly executing an approved action.
2. **Git as Authority**: The repository on **SATURN** is the source of truth. Agents must not modify the runtime state on nodes directly without corresponding (or pending) Git state, unless it's an emergency mitigation.
3. **Human-in-the-Loop (HIL)**: All destructive or structural changes (restarts, deployments, config changes) must follow the [Action Approval Model](../services/agent-system/action-model.md).
4. **Idempotency**: All scripts and actions proposed or executed by agents MUST be idempotent.
5. **Context-Awareness**: Agents MUST read the `README.md` and `docs/agents.md` at the start of every session to align with current infrastructure standards.
## 2. Agent Roles
| Role | Responsibility | Scope |
|------|----------------|-------|
| **Observer** | Monitors health, logs, and events. | Read-only access to `/opt/homelab/events` and `logs`. |
| **Stability Agent** | Local node watchdog, event emitter. | Local node runtime, `service.yaml` healthchecks. |
| **Orchestrator** | High-level planning, workload placement. | Repository-wide, multi-node topology. |
| **Materializer** | Translates high-level intent into Docker/System state. | Execution of `approved` actions. |
## 3. Discovery Protocol
Agents must use the following entry points to understand the system:
1. **Topology**: `inventory/topology.yaml` for node list and roles.
2. **Capabilities**: `hosts/<node>/capabilities.yaml` to understand hardware/software constraints.
3. **Service Contract**: `services/<service>/service.yaml` to understand how to check health and manage a service.
4. **Operational State**: `/opt/homelab/state/` on local nodes for real-time status.
## 4. Interaction with Humans
Agents communicate with the operator via the `agent-system/telegram-bot`.
- **Alerting**: Agents emit events to the event system. Critical events are forwarded to Telegram.
- **Proposals**: When an agent identifies a need for change (e.g., "Service X is failing, suggest restart"), it creates a `pending` action in `/opt/homelab/actions/pending/`.
- **Approval**: Agents must wait for the action status to transition to `approved` before execution.
## 5. Decision Logic (Reasoning)
When making decisions, agents MUST prioritize:
1. **Safety**: Do not violate power constraints (see `capabilities.yaml`).
2. **Stability**: Prefer keeping services on their `owner_node` unless it's down.
3. **Connectivity**: On intermittent nodes (CHELSTY), avoid actions requiring heavy WAN traffic during low-signal periods.
## 6. Access Control for Agents
- **Filesystem**: Agents should run as the `homelab` user or equivalent with restricted sudo access to `docker compose`.
- **Secrets**: Agents MUST NOT attempt to read `.env` files unless specifically tasked with credential rotation. They should treat secrets as opaque handles.

View file

@ -83,3 +83,10 @@ Future autonomous agents will use this metadata to:
2. **Generate Plans:** Create step-by-step deployment or migration plans based on hardware compatibility. 2. **Generate Plans:** Create step-by-step deployment or migration plans based on hardware compatibility.
3. **Validate Topology:** Ensure that a proposed multi-node setup doesn't violate networking or operational constraints (e.g., don't put a DB on an intermittent node). 3. **Validate Topology:** Ensure that a proposed multi-node setup doesn't violate networking or operational constraints (e.g., don't put a DB on an intermittent node).
4. **Propose Failover:** Automatically suggest the best alternative node during an outage. 4. **Propose Failover:** Automatically suggest the best alternative node during an outage.
## Agent Reasoning Logic
When an agent parses `capabilities.yaml`, it should apply these heuristics:
- **Intermittent Connectivity**: If `operational.connectivity == "intermittent"`, do not schedule high-bandwidth syncs or critical cloud-dependent services.
- **Power Constraints**: If `operational.power_constraint == "low-power"`, avoid heavy LLM inference or continuous high-CPU tasks.
- **Availability Target**: If `availability_target == "high"`, this node is a candidate for hosting control-plane failovers.

View file

@ -1,60 +1,154 @@
# CHELSTY Runtime # CHELSTY Runtime
This document describes the runtime environment and deployment flow for CHELSTY, an offline-capable home automation edge node. This document describes the runtime environment and deployment flow for CHELSTY, an offline-capable home automation edge node split across two VMs.
| Node | Role | Services |
|------|------|----------|
| `chelsty-infra` | LTE edge hypervisor | Mosquitto, Zigbee2MQTT, stability-agent, node-agent |
| `chelsty-ha` | Home Assistant VM | homeassistant (no node-agent — see below) |
Both nodes share an LTE uplink and must function fully offline (Zigbee, MQTT, HA automations) without any connectivity to SATURN, VPS, or Forgejo.
## Runtime Layout ## Runtime Layout
The CHELSTY runtime is located at `/opt/homelab`. ```
/opt/homelab/
- `/opt/homelab/config/`: Service-specific configurations and compose overrides. ├── config/ # Service-specific configs and secrets (not in Git)
- `/opt/homelab/data/`: Persistent data for services. │ ├── mosquitto/
- `/opt/homelab/logs/`: Service logs. │ └── zigbee2mqtt/
├── data/ # Persistent service data
### Key Service Locations │ ├── mosquitto/ # Persistence DB, password file
- **Mosquitto**: `/opt/homelab/config/mosquitto/` │ └── zigbee2mqtt/
- **Zigbee2MQTT**: `/opt/homelab/config/zigbee2mqtt/` │ └── data/ # z2m config, coordinator backup, network key
└── logs/
```
## SLZB-06U Integration ## SLZB-06U Integration
CHELSTY uses a SMLIGHT SLZB-06U Zigbee coordinator connected via Ethernet/TCP. CHELSTY uses a SMLIGHT SLZB-06U Zigbee coordinator connected over Ethernet/TCP.
- **Coordinator IP**: 192.168.1.105 - **Coordinator IP**: `192.168.1.105`
- **Port**: 6638 - **Port**: `6638`
- **Protocol**: TCP (ezsp adapter) - **Adapter**: `ezsp` (deprecated — migration to `ember` recommended, requires only changing `adapter: ember` in `configuration.yaml`)
- **Zigbee2MQTT config key**: `serial.port: tcp://192.168.1.105:6638`
Zigbee2MQTT is configured to connect to this coordinator over the local network. ⚠️ Never use `/dev/ttyUSB0` — the coordinator is always TCP-only on this site.
## Offline & LTE Assumptions ## Networking Constraints
- **WAN Resilience**: All core automation (Zigbee, MQTT) runs locally on CHELSTY. ### Mosquitto — `network_mode: host`
- **Connectivity**: LTE provides intermittent uplink for remote management and Tailscale access. Mosquitto runs with `network_mode: host` so that all containers on the same host can reach it at `localhost:1883`. **Do not change this.**
- **Home Assistant**: Runs in a separate VM, connecting to the Mosquitto broker on CHELSTY.
### Zigbee2MQTT — bridge network + extra_hosts
Zigbee2MQTT runs in a bridge-networked container (needed for port mapping compatibility with docker-compose v1). To reach the host-networked Mosquitto:
```yaml
# hosts/chelsty-infra/runtime/zigbee2mqtt/docker-compose.override.yml
services:
zigbee2mqtt:
extra_hosts:
- "mosquitto:host-gateway"
```
This maps the `mosquitto` hostname inside the z2m container to the Docker host gateway IP, so `mqtt://mosquitto:1883` reaches the host-networked Mosquitto process.
**Why not `network_mode: host` for z2m?**
chelsty-infra runs docker-compose v1 (1.29.2). In v1, `network_mode: host` cannot coexist with `ports:` declared in the base `docker-compose.yml` — raises `InvalidArgument`. The `extra_hosts` approach avoids this.
## Zigbee2MQTT Config Location
The `configuration.yaml` **must be writable** — z2m migrates and rewrites it on startup. It lives in the data directory:
```
/opt/homelab/data/zigbee2mqtt/data/configuration.yaml
```
This path is mounted read-write by the base `docker-compose.yml`:
```yaml
volumes:
- /opt/homelab/data/zigbee2mqtt/data:/app/data
```
Do **not** mount `configuration.yaml` as a separate `:ro` volume — z2m will fail with `EROFS`.
### Minimal configuration.yaml
```yaml
homeassistant: true
permit_join: false
mqtt:
base_topic: zigbee2mqtt
server: mqtt://mosquitto:1883
serial:
port: tcp://192.168.1.105:6638
adapter: ezsp
frontend:
port: 8080
advanced:
log_level: info
```
## chelsty-ha — No node-agent
`chelsty-ha` does not have a node-agent deployed. Home Assistant is monitored indirectly: if MQTT goes silent on `chelsty-infra`, HA is likely down.
In `hosts/chelsty-ha/services.yaml`:
```yaml
services:
homeassistant:
monitor: false # No node-agent; suppresses supervisor action generation
```
Remove `monitor: false` once node-agent is bootstrapped on this VM.
## Deployment Flow ## Deployment Flow
1. **Initial Bootstrap**: ### Initial Bootstrap
Run the bootstrap script on the CHELSTY node: ```bash
```bash ./scripts/bootstrap/chelsty-runtime.sh
./scripts/bootstrap/chelsty-runtime.sh ```
```
2. **Manual Configuration**: ### Deploy services
- Edit `/opt/homelab/config/zigbee2mqtt/.env` with MQTT credentials. ```bash
- Add Mosquitto user: ./scripts/deploy/deploy-node.sh chelsty-infra
```bash ./scripts/deploy/deploy-node.sh chelsty-ha
sudo mosquitto_passwd -b /opt/homelab/data/mosquitto/config/password.txt <user> <password> ```
```
3. **Service Deployment**: ### Manual (SSH) — chelsty-infra uses docker-compose v1
Use the staged deployment runtime: ```bash
```bash ssh oskar@100.122.201.22
./scripts/deploy/deploy-node.sh chelsty cd ~/homelab-codex-ws/services/<service>
``` docker-compose -f docker-compose.yml \
-f ../../hosts/chelsty-infra/runtime/<service>/docker-compose.override.yml \
up -d --build --force-recreate
```
## Recovery Procedure > **Note:** `docker compose` (v2) is **not** available on chelsty-infra — always use `docker-compose` (hyphenated, v1 1.29.2).
In case of runtime failure: ## Recovery Procedures
1. Verify Docker and Compose plugin: `docker compose version`
2. Re-run bootstrap script to ensure directory structure and basic configs. ### Mosquitto stopped
3. Check Mosquitto logs: `tail -f /opt/homelab/data/mosquitto/log/mosquitto.log` ```bash
4. Verify SLZB-06U reachability: `ping 192.168.1.105` ssh oskar@100.122.201.22 "docker start mosquitto"
# Ensure restart policy is correct:
docker update --restart unless-stopped mosquitto
```
### Zigbee2MQTT won't start
1. Check logs: `docker logs zigbee2mqtt --tail 50`
2. Verify SLZB-06U reachable from host: `nc -zv 192.168.1.105 6638`
3. Verify config is not empty: `cat /opt/homelab/data/zigbee2mqtt/data/configuration.yaml`
4. If config missing, recreate from the minimal template above
### SLZB-06U unreachable
`192.168.1.105:6638` EHOSTUNREACH means the coordinator is offline or the LAN is down. Zigbee2MQTT will keep retrying — no restart needed once the coordinator returns.
## Critical Backup Sets
| Data | Path |
|------|------|
| HA config + DB | `/opt/homelab/data/homeassistant/` on chelsty-ha |
| z2m config + coordinator backup + network key | `/opt/homelab/data/zigbee2mqtt/data/` |
| Mosquitto persistence + password file | `/opt/homelab/data/mosquitto/` |
| SLZB-06U coordinator state | Backup via SLZB-06U web UI at `192.168.1.105` |
> ⚠️ The Zigbee network key is in `configuration.yaml` or `coordinator_backup.json` — losing it requires re-pairing all devices.

View file

@ -0,0 +1,42 @@
### CHELSTY Stability Agent
The stability-agent on CHELSTY provides local observability and health monitoring for the node's services and infrastructure.
#### Purpose
It acts as a filesystem-first watchdog that detects anomalies in the local runtime environment without taking autonomous destructive actions (like restarts). It serves as the primary data source for node-level stability metrics.
#### Monitoring Scope
* **Docker Containers**: Monitors all local containers. If a container is not in the `running` state, a `containers_not_running` event is generated.
* **Disk Usage**: Monitors the root filesystem. Generates `disk_usage_high` events if usage exceeds the configured threshold.
* **Connectivity**:
* Checks if the Tailscale socket or interface is available.
* Checks reachability of the local Mosquitto MQTT broker.
* **Zigbee2MQTT**: Specifically tracks the presence and status of the Zigbee2MQTT service.
#### Storage and Integration
* **Heartbeat**: Updated every cycle at `/opt/homelab/state/stability-agent.heartbeat`.
* **State Summary**: A JSON summary of all latest checks at `/opt/homelab/state/stability-agent.json`.
* **Events**: Append-only JSON lines at `/opt/homelab/events/YYYY-MM-DD/chelsty-infra/events.jsonl`.
#### Deployment
The service is deployed via Docker Compose on CHELSTY.
```bash
cd services/stability-agent
docker compose up -d
```
#### Configuration
Configuration is managed via environment variables in `docker-compose.override.yml` on the host.
| Variable | Description | Default |
|----------|-------------|---------|
| `STABILITY_CHECK_INTERVAL` | Seconds between checks | `60` |
| `DISK_THRESHOLD_PCT` | Disk usage alert threshold | `90` |
| `MQTT_HOST` | MQTT broker hostname | `mosquitto` |
| `MQTT_PORT` | MQTT broker port | `1883` |

View file

@ -7,57 +7,92 @@ The Observer Runtime is a lightweight agent responsible for synthesizing the ope
The observer follows a filesystem-first approach, consuming append-only events and generating a normalized world model. It is designed to be idempotent, resumable, and resilient to intermittent node connectivity. The observer follows a filesystem-first approach, consuming append-only events and generating a normalized world model. It is designed to be idempotent, resumable, and resilient to intermittent node connectivity.
### Inputs ### Inputs
- `/opt/homelab/events/`: Normalized JSON events. - `/opt/homelab/events/`: Normalized JSON events (one `.json` file per event, organized by date and node).
- `/opt/homelab/state/`: Deployment stage markers and internal observer checkpoint. - `/opt/homelab/state/observer_checkpoint.json`: Per-node checkpoint dict (see below).
- `/opt/homelab/logs/`: Detailed execution logs and diagnostics.
- Repository Inventory: `inventory/topology.yaml` and `hosts/*/services.yaml`. - Repository Inventory: `inventory/topology.yaml` and `hosts/*/services.yaml`.
### World Model Output ### World Model Output
Generated under `/opt/homelab/world/`: Generated under `/opt/homelab/world/`:
- `nodes.json`: Current node availability, roles, and last seen timestamps. - `nodes.json`: Current node availability, roles, disk/memory pressure, last seen timestamps. Dict keyed by node name.
- `services.json`: Service health status and links to active incidents. - `services.json`: Service health status and links to active incidents. Dict keyed by `"node/service"`.
- `deployments.json`: Tracking of active and historical deployment runs by `correlation_id`. - `deployments.json`: Tracking of active and historical deployment runs by `correlation_id`.
- `incidents.json`: Correlated operational issues, including repeat failures and resolution status. - `incidents.json`: Correlated operational issues, including repeat failures and resolution status.
- `runtime-summary.json`: High-level overview for dashboards and planner agents. - `runtime-summary.json`: High-level overview for dashboards and planner agents.
## Incident Lifecycle ## Checkpoint Format
The observer implements lightweight incident correlation: The observer tracks per-node progress to avoid silently skipping event directories:
1. **Detection**: When a `service_unhealthy` or `healthcheck_failed` event is consumed, a new incident is created or an existing active incident for that service is updated.
2. **Correlation**: Multiple failure events for the same service on the same node are collapsed into a single incident, tracking the `occurrence_count`.
3. **Diagnostics**: Deployment failures (`deployment_failed`) automatically attach references to diagnostic files if present in the event payload.
4. **Resolution**: A `service_recovered` event for a service will transition any active incidents for that service to a `resolved` state.
### Example Incident JSON
```json ```json
{ {
"inc-1715518800-saturn-mosquitto": { "node_checkpoints": {
"id": "inc-1715518800-saturn-mosquitto", "vps": "/opt/homelab/events/2026-05-27/vps/evt-vps-1234.json",
"node": "saturn", "piha": "/opt/homelab/events/2026-05-27/piha/evt-piha-5678.json",
"service": "mosquitto", "chelsty-infra": "/opt/homelab/events/2026-05-27/chelsty-infra/evt-chelsty-infra-9012.json"
"status": "resolved",
"severity": "error",
"started_at": "2026-05-12T12:05:00Z",
"last_occurrence": "2026-05-12T12:06:00Z",
"occurrence_count": 2,
"events": [
"2026-05-12T12:05:00Z",
"2026-05-12T12:06:00Z"
],
"correlation_id": "hc-1",
"resolved_at": "2026-05-12T12:10:00Z"
} }
} }
``` ```
A single global checkpoint (`last_processed_file`) was replaced with this per-node dict because the old approach silently skipped any node directory that sorts alphabetically before the last-seen node (e.g. `piha/` would be skipped when the checkpoint pointed to `vps/`).
**Reset:** Delete `/opt/homelab/state/observer_checkpoint.json`. The observer will reprocess all events and rebuild world state from scratch.
## Event Types
### Negative events (create/escalate incidents)
- `service_unhealthy`, `healthcheck_failed` — open or increment an active incident
- `deployment_failed` — record failure in deployments.json
### Positive events (resolve state)
- `service_healthy` — marks service status as `healthy` **and** resolves any active incident for that service
- `service_recovered` — alias, same effect
- `deployment_completed` — marks deployment as completed
### Node events
- `node_online`, `node_offline` — update node status in nodes.json
- `disk_pressure_*` — set `disk_pressure` field on the node record
## Incident Lifecycle
1. **Detection**: A `service_unhealthy` or `healthcheck_failed` event creates or increments an active incident.
2. **Correlation**: Multiple failure events for the same `node/service` are collapsed into one incident, incrementing `occurrence_count`.
3. **Resolution**: A `service_healthy` or `service_recovered` event resolves any active incident for that service, setting `status: resolved` and `resolved_at`.
4. **Expiry**: Resolved incidents older than 7 days are pruned from world state by `_prune_stale_world()`.
### Example Incident JSON
```json
{
"inc-1715518800-vps-observer": {
"id": "inc-1715518800-vps-observer",
"node": "vps",
"service": "observer",
"status": "resolved",
"severity": "error",
"started_at": 1715518800.0,
"last_occurrence": 1715518860.0,
"occurrence_count": 2,
"trigger_type": "containers_not_running",
"resolved_at": 1715519100.0
}
}
```
## World State Pruning
`_prune_stale_world()` runs every reconcile cycle and removes:
1. **Stale nodes** — nodes not present in `inventory/topology.yaml` (e.g. ghost nodes created when `NODE_NAME` was unset and fell back to the container's 12-char hex ID).
2. **Services of stale nodes** — all `node/service` keys whose node was pruned.
3. **Ghost service keys** — service keys whose service-name portion matches the pattern `<12hexchars>_<name>` (Docker internal stale-state artifacts, created when node-agent used `c.name` instead of the compose label).
4. **Expired incidents** — resolved incidents older than 7 days.
## Runtime Behavior ## Runtime Behavior
### Idempotency ### Idempotency
The observer processes events in order. If the world state is lost, deleting the checkpoint file (`/opt/homelab/state/observer_checkpoint.json`) will cause the observer to re-process all events and rebuild the world state. The observer processes events in order. Deleting the checkpoint and restarting replays all events and produces the same world state.
### Resumability
The observer tracks the last processed event file in its checkpoint. Upon restart, it continues from the next available event.
### Deployment Tracking ### Deployment Tracking
Deployments are tracked via `correlation_id`. The observer synthesizes the start, end, and status of each deployment run, providing a clear history of changes to the environment. Deployments are tracked via `correlation_id`. The observer synthesizes the start, end, and status of each deployment run from events.
### Topology Filtering
Events from nodes not listed in `inventory/topology.yaml` are discarded during pruning. This prevents transient bootstrap noise from polluting world state.

View file

@ -0,0 +1,234 @@
# SESSION: Budowa planner-agent — LLM-based diagnostics
**DATA:** 2026-05-27
**REZULTAT:** planner-agent działa na SOLARIA (`healthy`), Ollama primary, cloud fallback gotowy do włączenia
---
## Co zostało zbudowane
### `services/planner-agent/src/llm_router.py`
Moduł LLM routing z local-first fallback chain:
- **`LLMRouter`** — główna klasa routingu przez litellm
- **`ModelConfig`** — konfiguracja jednego modelu (name, timeout, api_base, extra_kwargs)
- **`ModelMetrics`** — liczniki per model × outcome (`success`/`fallback`/`error`); success_rate
- **`RouteResult`** — wynik routingu z `content`, `model_used`, `attempts`, `latency_ms`
- **`AttemptRecord`** — zapis jednej próby (model, outcome, reason, latency_ms)
- **`_extract_json_from_fence()`** — wydobywa JSON z bloków ` ```json ``` ` jeśli model nie odpowie czystym JSON
Domyślny chain: `ollama/qwen2.5:7b` (8s) → `claude-haiku-4-5-20251001` (30s) → `claude-sonnet-4-6` (30s)
Metryki każdego wywołania publikowane na Redis kanał `llm_router_metrics`.
### `services/planner-agent/src/planner.py`
Główna pętla agenta:
- **`PlannerAgent`** — async agent: Redis sub → diagnoza LLM → pending action file → event
- **`HealthEvent`** — znormalizowane zdarzenie zdrowotne z Redis (node, service, event_type, severity, payload)
- **`ActionProposal`** — propozycja akcji z pełnymi metadanymi; `.to_action_file()` → format executora
- **`CooldownTracker`** — gate 5-minutowy per `svc_key` (node/service); NIE rejestruje jeśli LLM się wysypał
- **`parse_event()`** — normalizuje dwa formaty wejściowe (node-agent / control-plane)
- **`write_pending_action()`** — atomiczny zapis: `.tmp` → rename
- **`emit_event()`** — zapis zdarzenia `remediation_started` do systemu plików (bez importów z control-plane)
Pipeline:
```
Redis msg → parse_event() → benign skip → cooldown gate → _propose_action() (LLM)
→ write_pending_action() → emit_event("remediation_started")
```
### Pliki towarzyszące
| Plik | Opis |
|------|------|
| `service.yaml` | Kontrakt operacyjny: owner_node=solaria, deps=redis+ollama, healthcheck=file |
| `docker-compose.yml` | env_file + extra_hosts:host-gateway + ANTHROPIC_API_KEY w environment |
| `Dockerfile` | python:3.11-slim, litellm, redis, jsonschema, structlog |
| `healthcheck.sh` | Sprawdza wiek pliku heartbeat (max 300s) |
| `requirements.txt` | litellm, redis, jsonschema, structlog |
| `tests/test_planner.py` | 49 testów jednostkowych |
| `tests/test_llm_router.py` | 34 testy jednostkowe |
---
## Kluczowe decyzje architektoniczne
### 1. HITL invariant (Human-in-the-loop)
Planner **wyłącznie** zapisuje do `actions/pending/`. Executor wymaga pliku w `actions/approved/`.
Planner nigdy nie wykona akcji samodzielnie — to fundamentalna zasada systemu.
Implementacja: `write_pending_action()` pisze do `pending/`, żadna ścieżka w kodzie nie dotyka `approved/`.
### 2. Cooldown gate
Per `svc_key` (= `node/service`), domyślnie 5 minut. Cel: nie zalewać operatora powtórzonymi
propozycjami dla tego samego serwisu.
**Kluczowa decyzja:** cooldown NIE jest rejestrowany jeśli cały chain LLM się wysypał.
Dzięki temu kolejne zdarzenie może spróbować ponownie, zamiast być cicho zablokowanym
przez 5 minut mimo że nie powstała żadna propozycja.
### 3. Fallback chain — local-first
Kolejność: Ollama (lokalny GPU) → Haiku → Sonnet.
Uzasadnienie:
- Ollama nie wysyła danych do zewnętrznych serwisów; niskie opóźnienie dla prostych przypadków
- Haiku = szybki i tani cloud fallback
- Sonnet = ostatnia deska ratunku dla trudnych przypadków
Odrzucenie modelu na podstawie: timeout, błąd sieci, wzorzec odmowy, invalid JSON, schema error.
### 4. Brak importów z control-plane
`services/planner-agent/` jest w pełni samodzielny. Nie importuje nic z
`services/control-plane/`. Emisja eventów jest implementowana lokalnie (kopia logiki
`scripts/lib/events.py`).
Uzasadnienie: planner musi działać nawet jeśli control-plane jest offline; oddzielne
cykl deploymentu.
### 5. structlog z PrintLoggerFactory
Nie używamy `structlog.stdlib.add_logger_name``PrintLogger` nie ma atrybutu `.name`.
Zamiast tego łańcuch procesorów: `add_log_level``TimeStamper``StackInfoRenderer`
`format_exc_info``JSONRenderer`.
### 6. NODE_NAME czytany w czasie wywołania, nie importu
`_emit_event_sync` czyta `NODE_NAME` z modułowego `NODE_NAME` przy każdym wywołaniu
(nie jako default parameter). Umożliwia patchowanie w testach.
---
## Problemy napotkane i rozwiązania
### Problem: `localhost` w kontenerze nie sięga do hosta
**Kontekst:** Ollama działa na SOLARIA pod `localhost:11434`. Kontener Docker
z domyślną siecią bridge nie może sięgnąć do hosta przez `localhost`.
**Rozwiązanie:**
1. Dodano `extra_hosts: - "host-gateway:host-gateway"` do docker-compose.yml
2. `.env` używa `OLLAMA_HOST=http://host-gateway:11434`
### Problem: `environment` vs `env_file` — podwójne zmienne
**Kontekst:** Pierwsza wersja docker-compose.yml miała wszystkie zmienne hardkodowane
w sekcji `environment` z fallback wartościami (`${VAR:-default}`). Powodowało to
że `.env` był opcjonalny a nie wymagany.
**Rozwiązanie:** Usunięto wszystkie zmienne runtime z `environment`, przeniesiono do `env_file`.
Pozostał tylko `ANTHROPIC_API_KEY` w `environment` (opcjonalny sekret, nie powinien być w pliku na dysku).
### Problem: `structlog.stdlib.add_logger_name` crashuje z PrintLogger
**Symptom:** `AttributeError: 'PrintLogger' object has no attribute 'name'`
**Rozwiązanie:** Usunięto `add_logger_name` z łańcucha procesorów. Nie jest
kompatybilny z `PrintLoggerFactory`.
### Problem: verify stage failuje zaraz po starcie
**Symptom:** `deploy.sh` raportuje FAILED przy verify bo heartbeat nie istnieje.
**Przyczyna:** Race condition — agent potrzebuje kilku sekund na uruchomienie
pętli i pierwsze `touch()` heartbeatu.
**Rozwiązanie:** Nie jest to prawdziwy błąd. Docker healthcheck ma `start_period: 30s`.
Kontener pokazuje `(healthy)` po 30s od startu.
### Problem: git pull z divergent branches na solaria
**Symptom:** Solaria miała 2 lokalne commity nie będące na Forgejo + ręczne zmiany w working tree.
`git pull` failował z "Need to specify how to reconcile divergent branches."
**Rozwiązanie:**
```bash
git checkout -- services/planner-agent/docker-compose.yml # porzuć ręczne zmiany
git fetch origin
git rebase origin/master # rebase local commits on top of master
```
---
## Status deploymentu na SOLARIA
```
Container: planner-agent Up ~30m (healthy)
Image: planner-agent-planner-agent
Node: solaria (100.100.231.104)
Heartbeat: /opt/homelab/state/planner-agent.heartbeat (age 0s)
Channels subscribed:
- health_events
- world_updates
LLM chain:
PRIMARY: ollama/qwen2.5-coder:14b @ http://host-gateway:11434
FALLBACK: claude-haiku-4-5-20251001 (disabled — brak ANTHROPIC_API_KEY)
FALLBACK: claude-sonnet-4-6 (disabled — brak ANTHROPIC_API_KEY)
Redis: redis://100.108.208.3:6379 ✓ connected
```
---
## Co zostało na później
### 1. ANTHROPIC_API_KEY — cloud fallback wyłączony
Haiku i Sonnet są skonfigurowane w chain ale nie mają klucza API.
Gdy Ollama nie da rady (złożony przypadek / timeout), chain się wysypie bez fallbacku.
Aby włączyć:
```bash
ssh oskar@100.100.231.104
echo "ANTHROPIC_API_KEY=sk-ant-..." >> /opt/homelab/config/planner-agent/.env
docker compose -f ~/homelab-codex-ws/services/planner-agent/docker-compose.yml up -d
```
### 2. End-to-end test z prawdziwym eventem
Planner jest podłączony do Redis i nasłuchuje, ale żadne zdarzenie jeszcze nie
przeszło przez pełną ścieżkę (LLM call → pending action → operator UI).
Test:
```bash
redis-cli -h 100.108.208.3 PUBLISH health_events '{
"type": "service_unhealthy",
"node": "piha",
"service": "mosquitto",
"severity": "error",
"payload": {"reason": "container exited"},
"timestamp": "2026-05-27T20:00:00Z"
}'
# Obserwuj: docker logs planner-agent -f
# Sprawdź: ls /opt/homelab/actions/pending/
```
### 3. Solaria local commits
Solaria ma 2 lokalne commity (`feat: add ECC skills`, `fix: remove duplicate CLAUDE.md sections`)
które nie są na Forgejo. Zostały zrebase'owane na top of master ale nie wypchnięte.
Należy je wypchnąć lub zreviewować i ewentualnie squashować.
### 4. Integracja z operator UI / Telegram
Propozycje w `actions/pending/` nie mają jeszcze kanału notyfikacji do operatora.
Telegram bot powinien wysyłać powiadomienie gdy pojawi się nowy plik w `pending/`.
---
## Commity tej sesji
```
ff6fda1 planner-agent: use env_file, keep only ANTHROPIC_API_KEY in environment
ca37fca Add planner-agent: LLM-powered remediation planner
(llm_router.py, planner.py, tests, service.yaml, docker-compose.yml,
healthcheck.sh, Dockerfile)
```

103
docs/sessions/2026-05-27.md Normal file
View file

@ -0,0 +1,103 @@
# SESSION: Stabilizacja systemu wieloagentowego homelabu
**DATE:** 2026-05-27
**RESULT:** System NOMINAL (97/97 services, 0 errors)
---
## PROBLEMS FOUND
- stability-agent nie generował akcji naprawczych — tylko redeploy, brak container_restart
- mosquitto na chelsty-infra padł i nikt go nie restartował (restart policy był `no`)
- zigbee2mqtt nigdy nie był wdrożony na chelsty-infra
- node-agent był pustym szkieletem — nie emitował `service_healthy`, więc `services.json` zawsze był pusty
- ghost services: node-agent używał `c.name` (może zwrócić `<12hex>_real-name`) zamiast etykiety `com.docker.compose.service`
- materializer na piha czytał ze swojego lokalnego Redis zamiast z control-plane API — Redis zawierał 80 przestarzałych wpisów z ghost kluczami; "Copy for AI" zwracał stare dane
- observer używał jednego globalnego checkpointu zamiast per-node — cicho pomijał katalogi z eventami sortujące się przed aktualnym checkpointem
- supervisor nie cancelował resolved actions — pending queue rósł bez końca
- `service_healthy` event nie zamykał aktywnych incydentów
- NODE_ALIAS_MAP nie był skonfigurowany — mismatch nazw nodów między eventem a topology
- chelsty-ha błędnie w scope monitoringu — nie ma na nim node-agenta
---
## FIXES SHIPPED (commits in master)
```
7277bdc Fix Copy for AI: materializer fetches from control-plane API instead of Redis
b40b832 Fix ghost service keys from hash-prefixed Docker container names
28e9534 observer: service_healthy resolves active incidents
46ae92b supervisor: also cancel pending actions for services removed from desired state
410bfe7 zigbee2mqtt: config goes in data dir (writable), not separate ro mount
b3912fe zigbee2mqtt: use extra_hosts host-gateway instead of network_mode: host
61e07f4 zigbee2mqtt override: clear ports list for docker-compose v1 host network compat
51002d4 Fix pending actions: node_exporter, zigbee2mqtt, chelsty-ha monitoring
fb7828b supervisor: auto-cancel pending actions when drift is resolved
2f19657 fix(node-agent): unique event IDs per service to prevent same-second overwrites
267742c vps/node-agent: add network_mode: host for control-plane health probe
4e8968f Fix service health tracking: emit service_healthy, control-plane endpoint, checkpoint migration
f4a8db9 fix(observer): per-node-directory checkpoints replace single global checkpoint
a5a3e22 fix(node-agent): skip SSH config file in rsync to avoid UID ownership errors
2349de5 fix(node-agent): correct VPS_EVENTS_HOST to actual VPS Tailscale IP
65bac4e fix(node-agent): mount host SSH key into container for event shipping
96bf326 fix(observer+operator-ui): fix stale world state, dict→list API, event time filter
ae33cce feat(node-agent): add runtime overrides for piha, solaria, chelsty-infra
c5c080b feat(vps): add node-agent runtime override with NODE_NAME=vps
01b7758 feat(node-agent): implement health monitor and safe cleanup policy
```
### Szczegóły kluczowych napraw
**fix(observer): per-node checkpoints**
Jeden globalny checkpoint `last_processed_file` cicho pomijał katalogi eventów sortujące się alfabetycznie przed ostatnim przetworzonym węzłem (np. piha/ < vps/). Zastąpiony słownikiem `{"node_checkpoints": {"piha": "...", "vps": "..."}}` per-node.
**fix(observer): ghost key pruning**
`_prune_stale_world()` teraz usuwa wpisy z services.json których klucz serwisu pasuje do wzorca `<12hexchars>_<name>` — artefakty z Docker internal state tracking.
**fix(node-agent): canonical container name**
`check_containers()` teraz używa `com.docker.compose.service` label jako nazwy kanonicznej. Fallback: strip hash prefix z `c.name`. Kontenery w stanie `created` są pomijane (Docker stale-state artifacts).
**fix(node-agent): service_healthy emission**
Node-agent teraz emituje `service_healthy` dla każdego uruchomionego zarządzanego kontenera co cykl. Bez tego `services.json` był zawsze pusty — supervisor generował flood "missing service" redeployów.
**fix(supervisor): auto-cancel resolved actions**
`_cancel_resolved_pending_actions()` przenosi pending akcje do `cancelled/` gdy:
- serwis stał się healthy (`drift_resolved_auto`)
- serwis został usunięty z desired state (`service_removed_from_desired_state`)
**fix(supervisor): monitor:false**
Pole `monitor: false` w `services.yaml` wyklucza serwis z generowania akcji supervisora. Używane dla `homeassistant` na chelsty-ha (brak node-agenta).
**fix(agent-system/materializer): control-plane API as source**
Materializer na piha teraz fetchuje dane z VPS control-plane API (`CONTROL_PLANE_URL=http://100.95.58.48:18180`) zamiast z lokalnego Redis. Redis zawierał 80 przestarzałych wpisów. Redis path zachowany jako fallback.
**fix(chelsty-infra/zigbee2mqtt): mosquitto networking**
Mosquitto działa z `network_mode: host` — kontenery bridge nie mogą go dosięgnąć przez localhost. Rozwiązanie: `extra_hosts: - "mosquitto:host-gateway"` w override z2m. Nie używamy `network_mode: host` dla z2m bo koliduje z `ports:` w docker-compose v1 (1.29.2 na chelsty-infra).
**fix(chelsty-infra/zigbee2mqtt): writable config**
z2m migruje i nadpisuje `configuration.yaml` przy starcie. Config musi być w katalogu z danymi: `/opt/homelab/data/zigbee2mqtt/data/configuration.yaml` (read-write mount), nie w osobnym `:ro` wolumenie.
---
## STAN KOŃCOWY
| Node | Status | Serwisy |
|------|--------|---------|
| vps | online | control-plane (4), node-agent, node_exporter, stability-agent |
| piha | online | agent-system (4), node-agent, stability-agent, monitoring stack |
| solaria | online | node-agent, stability-agent, AI workloads |
| chelsty-infra | online | mosquitto, zigbee2mqtt (z2m łączy się gdy SLZB-06U wróci online), node-agent, stability-agent |
| chelsty-ha | — | homeassistant (monitor:false — brak node-agenta, HA monitorowane pośrednio przez MQTT) |
**Action queue:** 0 pending, 0 approved, 0 running
**Incidents:** 0 active
**Ghost service keys:** 0
---
## ZNANE OGRANICZENIA / TODO
- SLZB-06U (Zigbee coordinator) offline — `192.168.1.105:6638` EHOSTUNREACH z chelsty-infra. Prawdopodobnie problem sprzętowy/sieciowy po stronie 192.168.1.0/24. z2m startuje i serwuje stronę błędu na :8080 — połączy się automatycznie gdy coordinator wróci.
- `ezsp` adapter w konfiguracji z2m jest deprecated — zalecana migracja do `ember`. Nie wymaga nowej konfiguracji, tylko zmiana pola `adapter: ember` w `configuration.yaml`.
- chelsty-ha nie ma node-agenta. Dodać gdy będzie dostępna maszyna lub manual bootstrap.
- Redis na piha nadal zawiera stare klucze `homelab:nodes:*`, `homelab:incidents:*` etc. — nie są już używane przez materializer w trybie API, można wyczyścić.

View file

@ -0,0 +1,62 @@
# Stability Agent Multi-Node Rollout
## Architecture Summary
The `stability-agent` is a lightweight Python service that monitors node health (disk, Docker containers, Tailscale, MQTT) and publishes state to a central Redis instance running on **PIHA**.
- **Source**: `services/stability-agent`
- **State Path**: `/opt/homelab/state`
- **Events Path**: `/opt/homelab/events`
- **Redis Target**: `100.108.208.3:6379` (PIHA)
## Why UI only showed CHELSTY
Previously, the `stability-agent` had `NODE_NAME` defaulted to `chelsty` and was only deployed there. The Agent System UI materializer on PIHA filters nodes based on the Redis keys `homelab:nodes:<NODE_NAME>`. Without other agents publishing their specific `NODE_NAME`, the UI remained limited to the single active node.
## Deployment
Use the helper script to deploy or generate commands. The script uses explicit Tailscale IPs for remote targets (piha, chelsty, vps) and runs locally for solaria.
```bash
# Print commands
./scripts/deploy/deploy-stability-agent.sh <node-name>
# Deploy via SSH (executes ssh oskar@<ip>)
./scripts/deploy/deploy-stability-agent.sh <node-name> --ssh
```
### Manual Steps per Node
The manual steps are encapsulated in `services/stability-agent/deploy-local.sh`. On the target node:
```bash
cd /home/oskar/homelab-codex-ws
git fetch origin
git checkout master
git pull origin master
cd services/stability-agent
./deploy-local.sh <node-name>
```
## Verification
### Fleet Overview
Run the verification script from any node with `redis-cli` access:
```bash
./scripts/deploy/verify-agent-fleet.sh
```
### Redis Inspection (on PIHA)
```bash
docker exec agent-system-redis redis-cli KEYS 'homelab:nodes:*'
docker exec agent-system-redis redis-cli HGETALL homelab:nodes:<node-name>
```
Verify Web UI backend:
```bash
curl -s http://127.0.0.1:18180/nodes
curl -k https://agents.okit.pl/nodes
```
## Troubleshooting
- **Redis empty after compose down**: The `agent-system-redis` on PIHA uses transient storage if not configured with a volume. If it restarts, agents must republish their state (they do this automatically every `CHECK_INTERVAL`).
- **Secrets**: `.env` files and local secrets are not committed to the repo. Ensure `MQTT_HOST` and other specific secrets are set via overrides if needed.
- **Telegram**: Telegram bot notifications can remain disabled if `TELEGRAM_BOT_TOKEN` is absent.
- **Docker Socket**: If the agent reports `unavailable` for Docker, ensure `/var/run/docker.sock` is mounted and the user has permissions.

View file

@ -49,9 +49,10 @@ Runtime state must live outside the repository to keep it immutable and clean.
## Service Standards ## Service Standards
1. **Normalization**: Every service MUST follow the `services/<service>/` layout. 1. **Normalization**: Every service MUST follow the `services/<service>/` layout.
2. **Metadata**: Every service MUST have a `service.yaml` defining its operational contract. 2. **Metadata**: Every service MUST have a `service.yaml` defining its operational contract. This is the primary source of truth for AI agents.
3. **Healthchecks**: Every service MUST have a `healthcheck.sh` for verification. 3. **Healthchecks**: Every service MUST have a `healthcheck.sh` for verification. Agents use this to emit stability events.
4. **Secrets**: NEVER commit secrets to Git. Use `env.example` as a template and populate `/opt/homelab/config/<service>/.env` on the host. 4. **Actionability**: Any automated recovery action proposed by an agent must be backed by a `service.yaml` definition.
5. **Secrets**: NEVER commit secrets to Git. Use `env.example` as a template and populate `/opt/homelab/config/<service>/.env` on the host. Agents must treat these as "black box" configurations.
## Docker Compose Standards ## Docker Compose Standards

126
docs/vps-control-plane.md Normal file
View file

@ -0,0 +1,126 @@
# VPS Control Plane
The VPS Control Plane is the orchestration brain of the homelab platform. It runs on the Hetzner VPS (Tailscale IP: `100.95.58.48`) and provides observability, automated reconciliation, and a web-based operator interface.
## Architecture
The control plane consists of four core services running as a Docker Compose stack under `services/control-plane/`:
| Container | Role |
|-----------|------|
| `control-plane-observer` | Synthesizes world state from events in `/opt/homelab/events/` |
| `control-plane-supervisor` | Detects drift between desired state (`hosts/*/services.yaml`) and actual state (`world/services.json`); writes pending actions |
| `control-plane-executor` | Executes approved actions from `/opt/homelab/actions/approved/` |
| `control-plane-ui` | Web interface for system monitoring and action approval; serves port 18180 |
All services use **filesystem-first** semantics with `/opt/homelab/` as the data exchange layer. All four run with `network_mode: host` and as UID 1000 (`homelab` user).
## Supervisor Behavior
### Desired State
Loaded from `hosts/*/services.yaml` each reconcile cycle. Services with `monitor: false` are silently skipped — use this for services without a node-agent (e.g. `homeassistant` on `chelsty-ha`).
### Drift Types
- `missing_service` — service is in desired state but absent from `services.json`
- `unhealthy_service` — service exists in `services.json` but `status != healthy`
### Action Types
| Trigger | Action type | Risk |
|---------|-------------|------|
| `containers_not_running`, `mqtt_unreachable` | `container_restart` | low |
| Any other / unknown | `redeploy` | guarded |
| Node `disk_pressure: high` | `disk_cleanup` | guarded |
### Action ID Stability
Action IDs are deterministic: `redeploy-{node}-{service}` or `container-restart-{node}-{service}`. The same drift always produces the same filename, making reconcile truly idempotent across supervisor restarts.
### Auto-Cancel
Pending `redeploy` and `container_restart` actions are automatically moved to `cancelled/` when:
- **`drift_resolved_auto`** — the service becomes `healthy` in actual state
- **`service_removed_from_desired_state`** — the service was removed from `services.yaml` or marked `monitor: false`
Only `pending` actions are auto-cancelled. Approved/running actions have been committed to by the operator and are never cancelled automatically.
### Node Name Resolution
The supervisor supports a `NODE_ALIAS_MAP` environment variable (JSON string) to map event/world-state node names to canonical topology names:
```bash
NODE_ALIAS_MAP='{"node-2": "chelsty-infra", "node-1": "piha"}'
```
## Deployment
### From SATURN (primary control node)
```bash
# Full deploy via SSH
./scripts/deploy/deploy-control-plane.sh --ssh
# Or manually:
ssh oskar@100.95.58.48 "cd ~/homelab-codex-ws && git pull origin master && cd services/control-plane && docker compose up -d --build --force-recreate"
```
### Direct on VPS
```bash
cd ~/homelab-codex-ws/services/control-plane
docker compose up -d --build --force-recreate
```
`deploy-local.sh` also creates the required `/opt/homelab/` directory structure and sets ownership to UID 1000 (requires `sudo`). If directories already exist, skip to the `docker compose` step directly.
### Verification
```bash
# On VPS
docker ps --filter "name=control-plane"
curl -s http://localhost:18180/summary | python3 -m json.tool
```
## Action Approval Workflow
```
Supervisor writes → /opt/homelab/actions/pending/<id>.json
→ Operator UI (port 18180) or Telegram Bot notifies
→ Operator clicks Approve
→ /opt/homelab/actions/approved/<id>.json
→ Executor executes → completed / failed
```
Possible action states: `pending → approved → running → completed / failed / rejected`
Auto-cancel path: `pending → cancelled/`
## Recovery
### World state is stale or corrupt
```bash
# On VPS — delete checkpoint to force full replay
rm /opt/homelab/state/observer_checkpoint.json
docker restart control-plane-observer
```
### Flood of pending actions after bootstrap
Check if node-agent is running and emitting `service_healthy` events on each node. Without `service_healthy`, the supervisor sees all services as missing and queues redeployments every cycle.
```bash
# Check node-agent on each node
ssh oskar@<node> "docker ps --filter name=node-agent && docker logs node-agent --tail 20"
```
### Rebuild from scratch
```bash
ssh oskar@100.95.58.48 "cd ~/homelab-codex-ws/services/control-plane && docker compose up -d --build --force-recreate"
```
## Integration
### piha agent-system webui (port 18180 on piha)
The `agent-system-runtime-materializer` on piha polls the VPS control-plane API every 10 seconds and mirrors world state to piha's local `/opt/homelab/world/`. This ensures the **"Copy for AI"** button in the piha webui (`agent-system-webui`) reflects the same clean state as the VPS API.
Override: `hosts/piha/runtime/agent-system/docker-compose.override.yml` — sets `CONTROL_PLANE_URL=http://100.95.58.48:18180`.
### Nginx Proxy Manager
The operator UI at port 18180 can be proxied via NPM for external access. No WebSocket support required.
### Log Locations
- Container logs: `docker compose logs -f` (from `services/control-plane/`)
- Runtime events: `/opt/homelab/events/YYYY-MM-DD/`
- World state: `/opt/homelab/world/`
- Action queue: `/opt/homelab/actions/{pending,approved,running,completed,failed,cancelled}/`

View file

@ -0,0 +1,24 @@
host: chelsty-ha
site: chelsty
capabilities:
networking:
reachability: tailscale-only
tailscale_ip: 100.122.201.23
ingress_suitability: false
bandwidth: LTE
runtime:
container_engine: docker
os: debian
operational:
connectivity: intermittent
availability_target: best-effort
offline_first: true
uplink: lte
deployment:
suitability:
- homeassistant
restricted: false

View file

@ -0,0 +1,20 @@
hostname: chelsty-ha
site: chelsty
roles:
- homeassistant
network:
tailscale_ip: 100.122.201.23
runtime:
root: /opt/homelab
deployment:
mode: pull
managed_by: saturn
constraints:
connectivity:
intermittent: true
uplink: lte

View file

@ -0,0 +1,12 @@
host: chelsty-ha
site: chelsty
services:
homeassistant:
role: home-automation-controller
offline_required: true
# monitor: false — chelsty-ha has no node-agent deployed, so there are no
# container-health events for the observer to track. HA is monitored
# indirectly via the chelsty-infra MQTT broker (if MQTT goes silent, HA
# is likely down). Re-enable once node-agent is bootstrapped on this VM.
monitor: false

View file

@ -1,3 +1,6 @@
host: chelsty-infra
site: chelsty
capabilities: capabilities:
hardware: hardware:
cpu: cpu:
@ -31,10 +34,11 @@ capabilities:
power_constraint: low-power power_constraint: low-power
connectivity: intermittent connectivity: intermittent
availability_target: best-effort availability_target: best-effort
offline_operation_required: true
deployment: deployment:
suitability: suitability:
- staging - staging
- homeassistant - infra
- edge - edge
restricted: false restricted: false

View file

@ -1,9 +1,10 @@
hostname: chelsty hostname: chelsty-infra
site: chelsty
roles: roles:
- edge - edge
- hypervisor - hypervisor
- homeassistant - infra
- staging - staging
network: network:

View file

@ -1,4 +1,4 @@
host: chelsty host: chelsty-infra
uplink: uplink:
type: lte type: lte
@ -20,7 +20,7 @@ exposure_classes:
networks: networks:
home_automation_lan: home_automation_lan:
purpose: Home Assistant, MQTT, Zigbee coordinator, and local device control. purpose: MQTT broker, Zigbee coordinator, and local device control.
offline_required: true offline_required: true
internet_required_for_core_operation: false internet_required_for_core_operation: false

View file

@ -1,4 +1,4 @@
host: chelsty host: chelsty-infra
runtime_root: /opt/homelab runtime_root: /opt/homelab
@ -9,12 +9,6 @@ conventions:
logs: /opt/homelab/logs logs: /opt/homelab/logs
services: services:
homeassistant:
data: /opt/homelab/data/homeassistant
config: /opt/homelab/config/homeassistant
logs: /opt/homelab/logs/homeassistant
backup_priority: critical
zigbee2mqtt: zigbee2mqtt:
data: /opt/homelab/data/zigbee2mqtt data: /opt/homelab/data/zigbee2mqtt
config: /opt/homelab/config/zigbee2mqtt config: /opt/homelab/config/zigbee2mqtt
@ -27,13 +21,13 @@ services:
logs: /opt/homelab/logs/mosquitto logs: /opt/homelab/logs/mosquitto
backup_priority: high backup_priority: high
backup_sets: stability-agent:
homeassistant: data: /opt/homelab/state
include: config: /opt/homelab/config/stability-agent
- /opt/homelab/config/homeassistant logs: /opt/homelab/events
- /opt/homelab/data/homeassistant backup_priority: low
restore_note: Restore before starting the Home Assistant container.
backup_sets:
zigbee2mqtt: zigbee2mqtt:
include: include:
- /opt/homelab/config/zigbee2mqtt - /opt/homelab/config/zigbee2mqtt

View file

@ -0,0 +1,88 @@
# Frigate NVR — chelsty-infra
# Hardware decode: Intel UHD 630 via VAAPI (/dev/dri/renderD128)
# Object detection: CPU (no Coral TPU)
# Cameras: 2x Reolink RLC-540 (5MP, WiFi)
#
# Required env vars in /opt/homelab/config/frigate/frigate.env:
# CAMERA1_IP, CAMERA1_USER, CAMERA1_PASS
# CAMERA2_IP, CAMERA2_USER, CAMERA2_PASS
# MQTT_USER, MQTT_PASS (if mosquitto auth is enabled)
mqtt:
enabled: true
host: 127.0.0.1
port: 1883
# user: "{MQTT_USER}"
# password: "{MQTT_PASS}"
detectors:
cpu1:
type: cpu
num_threads: 3
ffmpeg:
hwaccel_args: preset-vaapi
global_args:
- -hide_banner
- -loglevel
- warning
record:
enabled: true
retain:
days: 7
mode: all
events:
retain:
default: 14
mode: motion
snapshots:
enabled: true
retain:
default: 7
quality: 70
objects:
track:
- person
- car
- bicycle
filters:
person:
min_area: 5000
max_area: 100000
threshold: 0.7
cameras:
camera1:
ffmpeg:
inputs:
# Main stream — high-res recording
- path: rtsp://{CAMERA1_USER}:{CAMERA1_PASS}@{CAMERA1_IP}:554/h264Preview_01_main
roles:
- record
# Sub stream — low-res detection (lower CPU cost)
- path: rtsp://{CAMERA1_USER}:{CAMERA1_PASS}@{CAMERA1_IP}:554/h264Preview_01_sub
roles:
- detect
detect:
enabled: true
width: 640
height: 480
fps: 5
camera2:
ffmpeg:
inputs:
- path: rtsp://{CAMERA2_USER}:{CAMERA2_PASS}@{CAMERA2_IP}:554/h264Preview_01_main
roles:
- record
- path: rtsp://{CAMERA2_USER}:{CAMERA2_PASS}@{CAMERA2_IP}:554/h264Preview_01_sub
roles:
- detect
detect:
enabled: true
width: 640
height: 480
fps: 5

View file

@ -0,0 +1,25 @@
services:
frigate:
container_name: frigate
image: ghcr.io/blakeblackshear/frigate:stable
restart: unless-stopped
privileged: true
shm_size: "256mb"
network_mode: host
devices:
- /dev/dri/renderD128:/dev/dri/renderD128
volumes:
- /etc/localtime:/etc/localtime:ro
- /opt/homelab/config/frigate/config.yml:/config/config.yml
- /opt/homelab/config/frigate:/config/credentials:ro
- /opt/homelab/data/frigate:/media/frigate
tmpfs:
- /tmp/cache
env_file:
- /opt/homelab/config/frigate/frigate.env
healthcheck:
test: ["CMD-SHELL", "wget -q --spider http://localhost:5000/api/version 2>&1 || exit 1"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s

View file

@ -0,0 +1,11 @@
services:
node-agent:
environment:
- NODE_NAME=chelsty-infra
- NODE_TYPE=lte_node
- VPS_EVENTS_HOST=100.95.58.48
- VPS_EVENTS_USER=oskar
- VPS_EVENTS_PATH=/opt/homelab/events
- CHECK_INTERVAL=60
volumes:
- /home/oskar/.ssh:/root/.ssh:ro

View file

@ -0,0 +1,12 @@
services:
stability-agent:
environment:
- NODE_NAME=chelsty-infra
- SITE_NAME=chelsty
- REDIS_HOST=100.108.208.3
- REDIS_PORT=6379
- REDIS_ENABLED=true
- STABILITY_CHECK_INTERVAL=60
- DISK_THRESHOLD_PCT=85
- MQTT_HOST=mosquitto
- MQTT_PORT=1883

View file

@ -0,0 +1,21 @@
services:
zigbee2mqtt:
# mosquitto runs with network_mode: host on chelsty-infra.
# extra_hosts maps the 'mosquitto' hostname to the host gateway IP so that
# mqtt://mosquitto:1883 in configuration.yaml reaches the host-networked
# mosquitto process. Requires Docker 20.10+ (present on chelsty-infra).
extra_hosts:
- "mosquitto:host-gateway"
environment:
- TZ=Europe/Warsaw
healthcheck:
test: ["CMD-SHELL", "wget -qO- http://localhost:8080 > /dev/null 2>&1 || exit 1"]
interval: 30s
timeout: 10s
retries: 3
start_period: 90s
# Note: volumes NOT overridden here.
# The base docker-compose.yml mounts /opt/homelab/data/zigbee2mqtt/data:/app/data
# (read-write). configuration.yaml must be placed in that directory on the node:
# /opt/homelab/data/zigbee2mqtt/data/configuration.yaml
# z2m rewrites this file during migrations — read-only mount is not viable.

View file

@ -0,0 +1,37 @@
host: chelsty-infra
site: chelsty
services:
ha-diag-agent:
role: ha-diagnostic-agent
deployment_model: docker-compose
exposure: local-only
offline_required: false
depends_on:
local: []
external: [homeassistant]
config:
target_url: http://100.70.180.90:8123 # chelsty-ha via Tailscale (HAOS, separate VM)
location_tag: "chelsty"
events_dir: /opt/homelab/events/chelsty-infra
runtime:
config_path: /opt/homelab/config/ha-diag-agent
data_path: /var/lib/ha-diag-agent
node-agent:
role: node-stability-monitor
# LTE node: node-agent monitors and emits events but does NO Docker cleanup.
# Disk pressure on chelsty-infra is typically Frigate recordings; Frigate's
# own retain policy is the correct remediation, not docker prune.
deployment_model: docker-compose
exposure: local-only
offline_required: true
mosquitto:
role: local-mqtt-broker
zigbee2mqtt:
role: zigbee-mqtt-bridge
frigate:
role: nvr

View file

@ -1,13 +0,0 @@
services:
zigbee2mqtt:
volumes:
- ./configuration.yaml:/app/data/configuration.yaml:ro
environment:
- MQTT_USER=${MQTT_USER}
- MQTT_PASSWORD=${MQTT_PASSWORD}
# Healthcheck is already defined in base service, but we ensure compatibility
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080"]
interval: 10s
timeout: 5s
retries: 3

View file

@ -1,108 +0,0 @@
host: chelsty
exposure_classes:
local-only:
description: Reachable only from CHELSTY-local networks or container networks.
public_ingress: false
tailscale_required: false
tailscale-internal:
description: Reachable through the Tailscale mesh by approved tailnet clients.
public_ingress: false
tailscale_required: true
public:
description: Reachable from the public internet through an explicit ingress path.
public_ingress: true
tailscale_required: false
operational_constraints:
uplink: lte
connectivity: intermittent
offline_operation_required: true
must_not_depend_on:
- saturn
- vps
- forgejo
services:
homeassistant:
role: home-automation-controller
deployment_model: docker-compose
exposure: tailscale-internal
offline_required: true
depends_on:
local:
- mosquitto
- zigbee2mqtt
external: []
ports:
- name: http
container_port: 8123
protocol: tcp
runtime:
config_path: /opt/homelab/config/homeassistant
data_path: /opt/homelab/data/homeassistant
logs_path: /opt/homelab/logs/homeassistant
backup:
recommended: true
include:
- /opt/homelab/config/homeassistant
- /opt/homelab/data/homeassistant
notes:
- Back up before Home Assistant core, supervisor-equivalent, or integration upgrades.
- Keep local restore copies on CHELSTY because LTE connectivity may be unavailable during recovery.
zigbee2mqtt:
role: zigbee-mqtt-bridge
deployment_model: docker-compose
exposure: local-only
offline_required: true
depends_on:
local:
- mosquitto
external:
- slzb-06u
coordinator:
name: slzb-06u
connection: network
usb_device: null
ports:
- name: frontend
container_port: 8080
protocol: tcp
exposure: tailscale-internal
runtime:
config_path: /opt/homelab/config/zigbee2mqtt
data_path: /opt/homelab/data/zigbee2mqtt
logs_path: /opt/homelab/logs/zigbee2mqtt
backup:
recommended: true
include:
- /opt/homelab/config/zigbee2mqtt
- /opt/homelab/data/zigbee2mqtt
notes:
- Include configuration.yaml, database.db, coordinator backup files, and network key material.
- Restore Zigbee2MQTT state together with the SLZB-06U coordinator state when replacing hardware.
mosquitto:
role: local-mqtt-broker
deployment_model: docker-compose
exposure: local-only
offline_required: true
depends_on:
local: []
external: []
ports:
- name: mqtt
container_port: 1883
protocol: tcp
runtime:
config_path: /opt/homelab/config/mosquitto
data_path: /opt/homelab/data/mosquitto
logs_path: /opt/homelab/logs/mosquitto
backup:
recommended: true
include:
- /opt/homelab/config/mosquitto
- /opt/homelab/data/mosquitto
notes:
- Retain ACL, password, persistence, and bridge configuration if enabled.

View file

@ -0,0 +1,8 @@
services:
runtime-materializer:
environment:
# Pull world state from the VPS control-plane API instead of local Redis.
# The observer on VPS is the authoritative writer; mirroring its API output
# here ensures the webui /snapshot matches the clean 97-service state that
# the control-plane /summary endpoint serves.
CONTROL_PLANE_URL: "http://100.95.58.48:18180"

View file

@ -0,0 +1,4 @@
services:
brain-watchdog:
mem_limit: 64m
restart: unless-stopped

View file

@ -0,0 +1,11 @@
services:
node-agent:
environment:
- NODE_NAME=piha
- NODE_TYPE=sd_card
- VPS_EVENTS_HOST=100.95.58.48
- VPS_EVENTS_USER=oskar
- VPS_EVENTS_PATH=/opt/homelab/events
- CHECK_INTERVAL=60
volumes:
- /home/oskar/.ssh:/root/.ssh:ro

View file

@ -0,0 +1,7 @@
services:
stability-agent:
environment:
- NODE_NAME=piha
- REDIS_HOST=100.108.208.3
- REDIS_PORT=6379
- REDIS_ENABLED=true

42
hosts/piha/services.yaml Normal file
View file

@ -0,0 +1,42 @@
host: piha
services:
ha-diag-agent:
role: ha-diagnostic-agent
deployment_model: docker-compose
exposure: local-only
offline_required: false
depends_on:
local: []
external: [homeassistant]
config:
target_url: http://localhost:8123
location_tag: "ken"
events_dir: /opt/homelab/events/piha
runtime:
config_path: /opt/homelab/config/ha-diag-agent
data_path: /var/lib/ha-diag-agent
node-agent:
role: node-stability-monitor
deployment_model: docker-compose
exposure: local-only
offline_required: true
depends_on:
local: []
external: []
runtime:
config_path: /opt/homelab/config/node-agent
data_path: /opt/homelab/state
logs_path: /opt/homelab/events
brain-watchdog:
role: control-plane-watchdog
deployment_model: docker-compose
exposure: private
offline_required: false
depends_on:
local: []
external: [control-plane]
runtime:
config_path: /opt/homelab/config/brain-watchdog

View file

@ -0,0 +1,11 @@
services:
node-agent:
environment:
- NODE_NAME=solaria
- NODE_TYPE=ai_node
- VPS_EVENTS_HOST=100.95.58.48
- VPS_EVENTS_USER=oskar
- VPS_EVENTS_PATH=/opt/homelab/events
- CHECK_INTERVAL=60
volumes:
- /home/oskar/.ssh:/root/.ssh:ro

View file

@ -0,0 +1,7 @@
services:
stability-agent:
environment:
- NODE_NAME=solaria
- REDIS_HOST=100.108.208.3
- REDIS_PORT=6379
- REDIS_ENABLED=true

View file

@ -0,0 +1,15 @@
host: solaria
services:
node-agent:
role: node-stability-monitor
deployment_model: docker-compose
exposure: local-only
offline_required: true
depends_on:
local: []
external: []
runtime:
config_path: /opt/homelab/config/node-agent
data_path: /opt/homelab/state
logs_path: /opt/homelab/events

View file

@ -0,0 +1,39 @@
# Control-plane production overrides for the VPS deployment.
#
# NODE_ALIAS_MAP translates the node names that appear in raw event files
# (written by node agents / seed scripts) to the canonical names used in
# inventory/topology.yaml and hosts/*/services.yaml.
#
# Current live mapping (from /opt/homelab/events/ inspection):
# node-2 → chelsty (zigbee2mqtt / mosquitto / homeassistant node)
#
# Add further entries when new nodes come online and their event-source names
# differ from their topology names. Format is a single-line JSON object, e.g.:
# NODE_ALIAS_MAP='{"node-2":"chelsty","node-3":"piha"}'
#
# The executor inherits the canonical name from the action JSON written by the
# supervisor, so NODE_ALIAS_MAP is only required on the supervisor service.
#
# Memory limits: VPS has 4 GiB RAM, no swap. oom_score_adj -900 ensures the
# host kernel OOM-killer never targets control-plane containers. mem_limit
# provides a per-container cgroup ceiling so a leaking process is restarted by
# Docker before it can exhaust host memory.
services:
operator-ui:
mem_limit: 192m
oom_score_adj: -900
observer:
mem_limit: 192m
oom_score_adj: -900
supervisor:
mem_limit: 400m
oom_score_adj: -900
environment:
- NODE_ALIAS_MAP={"node-2":"chelsty"}
executor:
mem_limit: 64m
oom_score_adj: -900

View file

@ -0,0 +1,7 @@
# Control Plane Environment Variables
PORT=8080
HOMELAB_STATE_ROOT=/opt/homelab/state
HOMELAB_EVENTS_ROOT=/opt/homelab/events
HOMELAB_WORLD_ROOT=/opt/homelab/world
HOMELAB_ACTIONS_ROOT=/opt/homelab/actions
HOMELAB_CONFIG_ROOT=/opt/homelab/config

View file

@ -0,0 +1,16 @@
services:
node-agent:
environment:
- NODE_NAME=vps
- CHECK_INTERVAL=60
# host network mode: node-agent on VPS shares the host's network namespace
# so that localhost:18180 resolves to the control-plane's exposed port.
# Without this, localhost inside the container is the container's own loopback
# and the _check_control_plane_health() probe would always fail.
network_mode: host
# HARD memory ceiling: node-agent mounts /opt/homelab/events/ (page cache)
# and may accumulate Python RSS over hours; 640m cap ensures it is killed and
# auto-restarted by Docker before consuming host memory. oom_score_adj -900
# prevents the host kernel OOM-killer from picking it as a global victim.
mem_limit: 640m
oom_score_adj: -900

View file

@ -0,0 +1,9 @@
services:
stability-agent:
environment:
- NODE_NAME=vps
- REDIS_HOST=100.108.208.3
- REDIS_PORT=6379
- REDIS_ENABLED=true
mem_limit: 96m
oom_score_adj: -900

View file

@ -1 +0,0 @@
npm

43
hosts/vps/services.yaml Normal file
View file

@ -0,0 +1,43 @@
host: vps
services:
node-agent:
role: node-stability-monitor
deployment_model: docker-compose
exposure: local-only
offline_required: true
depends_on:
local: []
external: []
runtime:
config_path: /opt/homelab/config/node-agent
data_path: /opt/homelab/state
logs_path: /opt/homelab/events
control-plane:
role: management-and-orchestration
deployment_model: docker-compose
exposure: tailscale-internal
offline_required: false
depends_on:
local:
- node-agent
external:
- piha:redis
ports:
- name: http
container_port: 18180
protocol: tcp
runtime:
config_path: /opt/homelab/config/control-plane
data_path: /opt/homelab/data/control-plane
logs_path: /opt/homelab/logs/control-plane
node_exporter:
role: metrics-exporter
deployment_model: docker-compose
exposure: local-only
offline_required: true
depends_on:
local: []
external: []

View file

@ -17,6 +17,10 @@ nodes:
roles: roles:
- infra - infra
- monitoring - monitoring
services:
- node-agent
- ha-diag-agent
- brain-watchdog
solaria: solaria:
roles: roles:
@ -27,12 +31,25 @@ nodes:
roles: roles:
- edge - edge
- ingress - ingress
- control-plane
services:
# Repo-managed GitOps services (hosts/vps/services.yaml is authoritative)
- node-agent
- control-plane # executor, observer, supervisor, operator-ui
- node_exporter
- stability-agent
- npm # Nginx Proxy Manager — public ingress, TLS termination
- outline # Team wiki (outline + postgres + redis)
- joplin # Note sync server (joplin-server + postgres)
- ai-cluster # AI workers: codex-worker, openclaw, planner-worker,
# service-ops-worker, redis, mosquitto
chelsty: chelsty-infra:
site: chelsty
roles: roles:
- remote - remote
- hypervisor - hypervisor
- homeassistant - infra
- staging - staging
connectivity: connectivity:
uplink: lte uplink: lte
@ -40,10 +57,22 @@ nodes:
home_automation: home_automation:
offline_operation_required: true offline_operation_required: true
services: services:
- homeassistant
- zigbee2mqtt - zigbee2mqtt
- mosquitto - mosquitto
coordinator: coordinator:
model: SLZB-06U model: SLZB-06U
connection: network connection: network
usb: false usb: false
chelsty-ha:
site: chelsty
roles:
- remote
- homeassistant
connectivity:
uplink: lte
intermittent: true
home_automation:
offline_operation_required: true
services:
- homeassistant

View file

@ -0,0 +1,75 @@
#!/usr/bin/env bash
# vps-control-plane.sh - Bootstrap script for VPS control plane
set -e
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
RUNTIME_DIR="/opt/homelab"
VPS_CONFIG="$REPO_ROOT/hosts/vps/runtime"
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
log() { echo -e "${GREEN}[INFO]${NC} $1"; }
warn() { echo -e "${YELLOW}[WARN]${NC} $1"; }
error() { echo -e "${RED}[ERROR]${NC} $1"; exit 1; }
log "Starting VPS control plane bootstrap..."
# 1. Validate Docker availability
if ! command -v docker &> /dev/null; then
error "Docker is not installed. Please install Docker first."
fi
# 2. Validate compose plugin
if ! docker compose version &> /dev/null; then
error "Docker Compose plugin is not installed."
fi
log "Docker and Compose plugin verified."
# 3. Create filesystem-first runtime structure
log "Creating filesystem-first runtime structure in $RUNTIME_DIR..."
sudo mkdir -p "$RUNTIME_DIR/events" \
"$RUNTIME_DIR/state" \
"$RUNTIME_DIR/world" \
"$RUNTIME_DIR/actions/pending" \
"$RUNTIME_DIR/actions/approved" \
"$RUNTIME_DIR/actions/running" \
"$RUNTIME_DIR/actions/completed" \
"$RUNTIME_DIR/actions/failed" \
"$RUNTIME_DIR/actions/rejected" \
"$RUNTIME_DIR/config" \
"$RUNTIME_DIR/logs"
# 4. Set permissions
log "Setting permissions..."
sudo chown -R $USER:$USER "$RUNTIME_DIR"
chmod -R 755 "$RUNTIME_DIR"
# 5. Install environment file
log "Installing environment configuration..."
if [ ! -f "$RUNTIME_DIR/config/control-plane.env" ]; then
cp "$VPS_CONFIG/control-plane/env.example" "$RUNTIME_DIR/config/control-plane.env"
log "Created $RUNTIME_DIR/config/control-plane.env from template."
else
warn "Environment file already exists, skipping installation."
fi
# 6. Build and start the control plane
log "Building and starting control plane services..."
cd "$REPO_ROOT/services/control-plane"
docker compose build
docker compose up -d
log "VPS control plane bootstrap complete!"
echo -e "\n${YELLOW}Verification commands:${NC}"
echo "1. Check container status: docker compose ps"
echo "2. Check operator UI: curl http://localhost:8080/summary"
echo "3. Validate world state: ls -l $RUNTIME_DIR/world"
echo "4. Monitor events: tail -f $RUNTIME_DIR/events/*/*/*.json"

View file

@ -0,0 +1,23 @@
#!/bin/bash
# scripts/deploy/deploy-control-plane.sh
set -e
VPS_IP="100.95.58.48"
USER="oskar"
REMOTE_REPO_PATH="/home/oskar/homelab-codex-ws"
MODE=$1
case "$MODE" in
"--ssh")
echo "Deploying to VPS ($VPS_IP) via SSH..."
ssh -t "$USER@$VPS_IP" "cd $REMOTE_REPO_PATH && git pull origin master && cd services/control-plane && bash deploy-local.sh"
;;
"--print")
echo "ssh -t $USER@$VPS_IP \"cd $REMOTE_REPO_PATH && git pull origin master && cd services/control-plane && bash deploy-local.sh\""
;;
*)
echo "Usage: $0 [--ssh|--print]"
exit 1
;;
esac

View file

@ -0,0 +1,26 @@
#!/usr/bin/env bash
# deploy-frigate.sh - Deploy Frigate NVR on chelsty-infra (print or SSH)
MODE="print"
[[ "$1" == "--ssh" ]] && MODE="ssh"
TARGET="100.122.201.22"
NODE="chelsty-infra"
REPO_PATH="/home/oskar/homelab-codex-ws"
SERVICE_PATH="$REPO_PATH/hosts/chelsty-infra/runtime/frigate"
echo "HOST: $NODE"
echo "MODE: $MODE"
echo "TARGET: $TARGET"
# Secrets must exist at /opt/homelab/config/frigate/frigate.env on the node
# before first deploy. See config.yml for required variables.
DEPLOY_CMD="cd $REPO_PATH && git fetch origin && git checkout master && git pull origin master && cd $SERVICE_PATH && docker-compose pull && docker-compose up -d"
if [[ "$MODE" == "ssh" ]]; then
echo "--- Deploying Frigate to $NODE ($TARGET) via SSH ---"
ssh oskar@$TARGET "$DEPLOY_CMD"
else
echo "# --- Deployment commands for $NODE ---"
echo "ssh oskar@$TARGET '$DEPLOY_CMD'"
fi

View file

@ -8,6 +8,7 @@ set -e
REPO_PATH="${HOME}/homelab-codex-ws" REPO_PATH="${HOME}/homelab-codex-ws"
RUNTIME_PATH="/opt/homelab" RUNTIME_PATH="/opt/homelab"
HOSTNAME=$(hostname | tr '[:lower:]' '[:upper:]') HOSTNAME=$(hostname | tr '[:lower:]' '[:upper:]')
HOST_DIR="${REPO_PATH}/hosts/$(hostname | tr '[:upper:]' '[:lower:]')"
echo "--- Starting Deployment on ${HOSTNAME} ---" echo "--- Starting Deployment on ${HOSTNAME} ---"
@ -22,20 +23,33 @@ echo "Pulling latest changes..."
git pull git pull
# 2. Identify Services # 2. Identify Services
# Based on our convention, we look for services assigned to this host SERVICES=()
# For now, we'll check if a 'services.txt' exists in the host folder if [ -f "${HOST_DIR}/services.txt" ]; then
SERVICE_LIST="${REPO_PATH}/hosts/$(hostname | tr '[:upper:]' '[:lower:]')/services.txt" mapfile -t SERVICES < <(grep -v '^\s*#' "${HOST_DIR}/services.txt" | grep -v '^\s*$')
elif [ -f "${HOST_DIR}/services.yaml" ]; then
SERVICES=($(python3 -c "
import yaml, sys
try:
with open('${HOST_DIR}/services.yaml', 'r') as f:
data = yaml.safe_load(f)
if data and 'services' in data:
if isinstance(data['services'], dict):
print(' '.join(data['services'].keys()))
elif isinstance(data['services'], list):
print(' '.join(data['services']))
except Exception as e:
print(f'Error parsing YAML: {e}', file=sys.stderr)
sys.exit(1)
"))
fi
if [ ! -f "$SERVICE_LIST" ]; then if [ ${#SERVICES[@]} -eq 0 ]; then
echo "No services.txt found for ${HOSTNAME}. Skipping service deployment." echo "No services found for ${HOSTNAME}. Skipping service deployment."
exit 0 exit 0
fi fi
# 3. Deploy Services # 3. Deploy Services
while IFS= read -r service || [ -n "$service" ]; do for service in "${SERVICES[@]}"; do
[[ "$service" =~ ^#.*$ ]] && continue # Skip comments
[[ -z "$service" ]] && continue # Skip empty lines
echo "Deploying service: ${service}..." echo "Deploying service: ${service}..."
COMPOSE_FILE="${REPO_PATH}/services/${service}/docker-compose.yml" COMPOSE_FILE="${REPO_PATH}/services/${service}/docker-compose.yml"
@ -45,13 +59,10 @@ while IFS= read -r service || [ -n "$service" ]; do
continue continue
fi fi
# Target directory in runtime
TARGET_DIR="${RUNTIME_PATH}/services/${service}" TARGET_DIR="${RUNTIME_PATH}/services/${service}"
mkdir -p "$TARGET_DIR" mkdir -p "$TARGET_DIR"
# We use the compose file from the repo directly OVERRIDE_FILE="${HOST_DIR}/runtime/${service}/docker-compose.override.yml"
# but we can also handle overrides here
OVERRIDE_FILE="${RUNTIME_PATH}/config/${service}/docker-compose.override.yml"
COMPOSE_CMD="docker compose -f ${COMPOSE_FILE}" COMPOSE_CMD="docker compose -f ${COMPOSE_FILE}"
if [ -f "$OVERRIDE_FILE" ]; then if [ -f "$OVERRIDE_FILE" ]; then
@ -60,7 +71,6 @@ while IFS= read -r service || [ -n "$service" ]; do
fi fi
$COMPOSE_CMD up -d --remove-orphans $COMPOSE_CMD up -d --remove-orphans
done
done < "$SERVICE_LIST"
echo "--- Deployment Complete ---" echo "--- Deployment Complete ---"

View file

@ -0,0 +1,55 @@
#!/usr/bin/env bash
# deploy-stability-agent.sh - Helper to deploy stability-agent (print or SSH)
NODE=$1
MODE="print"
[[ "$2" == "--ssh" ]] && MODE="ssh"
if [[ -z "$NODE" ]]; then
echo "Usage: $0 <node-name> [--ssh]"
echo "Supported nodes: chelsty, piha, solaria, vps"
exit 1
fi
case "$NODE" in
piha) TARGET="100.108.208.3" ;;
chelsty) TARGET="100.122.201.22" ;;
vps) TARGET="100.95.58.48" ;;
solaria) TARGET="local" ;;
*)
echo "Error: Unknown node '$NODE'"
echo "Supported nodes: chelsty, piha, solaria, vps"
exit 1
;;
esac
echo "HOST: $NODE"
echo "MODE: $MODE"
echo "TARGET: $TARGET"
REPO_PATH="/home/oskar/homelab-codex-ws"
if [[ "$NODE" == "solaria" ]]; then
if [[ "$MODE" == "ssh" ]]; then
echo "--- Running local deployment for solaria ---"
cd "$REPO_PATH" && git fetch origin && git checkout master && git pull origin master && cd services/stability-agent && ./deploy-local.sh solaria
else
echo "# --- Deployment commands for solaria ---"
echo "cd $REPO_PATH"
echo "git fetch origin"
echo "git checkout master"
echo "git pull origin master"
echo "cd services/stability-agent"
echo "./deploy-local.sh solaria"
fi
else
# Remote nodes
SSH_CMD="ssh oskar@$TARGET 'cd $REPO_PATH && git fetch origin && git checkout master && git pull origin master && cd services/stability-agent && ./deploy-local.sh $NODE'"
if [[ "$MODE" == "ssh" ]]; then
echo "--- Deploying to $NODE ($TARGET) via SSH ---"
eval "$SSH_CMD"
else
echo "# --- Deployment commands for $NODE ---"
echo "$SSH_CMD"
fi
fi

View file

@ -1,270 +1,321 @@
#!/usr/bin/env bash #!/usr/bin/env bash
# deploy.sh - Staged deployment framework for homelab nodes. # scripts/deploy/deploy.sh — Saturn-side deploy dispatcher
# Usage: deploy.sh <target> [--dry-run] [--no-gate]
# target ∈ {control-plane, vps, piha, solaria, chelsty-infra}
# Exit codes: 0=ok 1=preflight 2=gate 3=execute 4=verify 5=handoff(sudo)
set -o pipefail set -uo pipefail
# --- Configuration --- REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
export RUNTIME_PATH="/opt/homelab" SSH_USER="${SSH_USER:-oskar}"
export STATE_DIR="${RUNTIME_PATH}/state/deploy" START_TIME=$(date +%s)
export LOG_DIR="${RUNTIME_PATH}/logs/deploy" TARGET=""
export REPO_PATH="${HOME}/homelab-codex-ws" DRY_RUN=false
export TIMESTAMP=$(date +%Y%m%d_%H%M%S) NO_GATE=false
export LOG_FILE="${LOG_DIR}/deploy_${TIMESTAMP}.log"
# --- Initialization --- usage() {
mkdir -p "$STATE_DIR" "$LOG_DIR" cat >&2 <<'EOF'
Usage: deploy.sh <target> [--dry-run] [--no-gate]
# Redirection for logging Targets:
exec > >(tee -a "$LOG_FILE") 2>&1 control-plane observer/supervisor/executor/operator-ui on VPS
vps all VPS GitOps services
piha PIHA services
solaria SOLARIA compute services
chelsty-infra CHELSTY edge node (LTE, longer SSH timeout)
# --- Load Libraries --- Flags:
LIB_PATH="${REPO_PATH}/scripts/lib" --dry-run run preflight + gate only; stop before deploy
source "${LIB_PATH}/log.sh" --no-gate skip pytest + docker build (emergency only; logged as WARNING)
source "${LIB_PATH}/state.sh"
source "${LIB_PATH}/inventory.sh"
source "${LIB_PATH}/compose.sh"
source "${LIB_PATH}/diagnostics.sh"
# --- CLI Parsing --- Exit codes: 0=ok 1=preflight 2=gate 3=execute 4=verify 5=handoff(sudo)
TARGET_HOST=$(hostname) EOF
TARGET_SERVICE="" exit 1
RESUME=false }
REQUESTED_STAGE=""
while [[ $# -gt 0 ]]; do while [[ $# -gt 0 ]]; do
case $1 in case $1 in
--host) control-plane|vps|piha|solaria|chelsty-infra)
TARGET_HOST="$2" TARGET="$1"; shift ;;
shift 2 --dry-run)
;; DRY_RUN=true; shift ;;
--service) --no-gate)
TARGET_SERVICE="$2" NO_GATE=true; shift ;;
shift 2 -h|--help)
;; usage ;;
--resume)
RESUME=true
shift
;;
--stage)
REQUESTED_STAGE="$2"
shift 2
;;
*) *)
if [[ "$1" =~ ^(prepare|validate|deploy|verify|diagnose|complete)$ ]]; then echo "Unknown argument: $1" >&2
REQUESTED_STAGE="$1" usage ;;
fi
shift
;;
esac esac
done done
# --- Stages --- [[ -z "$TARGET" ]] && { echo "Error: target is required." >&2; usage; }
stage_prepare() { case "$TARGET" in
local host=$1 control-plane) SSH_HOST="vps" ;;
if is_stage_complete "prepare" && [[ "$RESUME" == "true" ]]; then *) SSH_HOST="$TARGET" ;;
log "INFO" "Skipping PREPARE (already complete)" esac
return 0
fi
log "INFO" "Stage: PREPARE ($host)" case "$TARGET" in
set_stage "prepare" chelsty-*) SSH_TIMEOUT=30 ;;
*) SSH_TIMEOUT=5 ;;
esac
emit_event "deployment_started" "info" "deploy.sh" "all" "${TIMESTAMP}" "{\"stage\": \"prepare\"}" # ── PREFLIGHT ────────────────────────────────────────────────────────────────
cd "$REPO_PATH" || exit 1 preflight() {
log "INFO" "Pulling latest changes..." echo "=== PREFLIGHT ==="
if ! git pull; then
log "WARN" "Git pull failed, proceeding with local state (offline mode or network flap)"
fi
# Ensure runtime directories exist local branch
mkdir -p "${RUNTIME_PATH}/config" "${RUNTIME_PATH}/data" "${RUNTIME_PATH}/state" "${RUNTIME_PATH}/logs" branch=$(git -C "$REPO_ROOT" rev-parse --abbrev-ref HEAD)
if [[ "$branch" != "master" ]]; then
struct_log "prepare" "$host" "all" "success" "repo_updated" echo "ERROR: On branch '${branch}', not master. Switch to master and push first." >&2
mark_stage_complete "prepare"
}
stage_validate() {
local host=$1
if is_stage_complete "validate" && [[ "$RESUME" == "true" ]]; then
log "INFO" "Skipping VALIDATE (already complete)"
return 0
fi
log "INFO" "Stage: VALIDATE ($host)"
set_stage "validate"
for service in "${SERVICES[@]}"; do
log "INFO" "Validating $service..."
if [[ ! -d "${REPO_PATH}/services/$service" ]]; then
log "ERROR" "Service definition not found: $service"
struct_log "validate" "$host" "$service" "fail" "not_found"
return 1
fi
done
struct_log "validate" "$host" "all" "success" "validated"
mark_stage_complete "validate"
}
stage_deploy() {
local host=$1
if is_stage_complete "deploy" && [[ "$RESUME" == "true" ]]; then
log "INFO" "Skipping DEPLOY (already complete)"
return 0
fi
log "INFO" "Stage: DEPLOY ($host)"
set_stage "deploy"
local last_s=$(get_last_service)
local skip=false
if [[ "$RESUME" == "true" && -n "$last_s" ]]; then
skip=true
fi
for service in "${SERVICES[@]}"; do
if [[ "$skip" == "true" ]]; then
if [[ "$service" == "$last_s" ]]; then
skip=false
log "INFO" "Resuming from $service..."
else
log "INFO" "Skipping $service (already processed)"
continue
fi
fi
log "INFO" "Deploying $service..."
set_last_service "$service"
if ! run_compose_up "$service"; then
struct_log "deploy" "$host" "$service" "fail" "docker_compose_failed"
collect_diagnostics "$host" "$service"
return 1
fi
struct_log "deploy" "$host" "$service" "success" "deployed"
done
set_last_service ""
mark_stage_complete "deploy"
}
stage_verify() {
local host=$1
if is_stage_complete "verify" && [[ "$RESUME" == "true" ]]; then
log "INFO" "Skipping VERIFY (already complete)"
return 0
fi
log "INFO" "Stage: VERIFY ($host)"
set_stage "verify"
for service in "${SERVICES[@]}"; do
log "INFO" "Verifying $service..."
local health_script="${REPO_PATH}/services/${service}/healthcheck.sh"
if [[ -f "$health_script" ]]; then
if ! bash "$health_script"; then
log "ERROR" "Healthcheck failed for $service"
struct_log "verify" "$host" "$service" "fail" "healthcheck_failed"
collect_diagnostics "$host" "$service"
return 1
fi
else
# Generic check if container is running
if ! docker ps --filter "name=$service" --filter "status=running" | grep -q "$service"; then
log "ERROR" "Container $service is not running"
struct_log "verify" "$host" "$service" "fail" "container_not_running"
collect_diagnostics "$host" "$service"
return 1
fi
fi
struct_log "verify" "$host" "$service" "success" "verified"
done
mark_stage_complete "verify"
}
stage_complete() {
local host=$1
log "INFO" "Stage: COMPLETE ($host)"
set_stage "complete"
struct_log "complete" "$host" "all" "success" "deployment_finished"
clear_deployment_state
}
# --- Execution Logic ---
run_deployment() {
local start_stage=$1
# Sequential execution from start_stage
case "$start_stage" in
prepare)
stage_prepare "$TARGET_HOST" || return 1
;&
validate)
stage_validate "$TARGET_HOST" || return 1
;&
deploy)
stage_deploy "$TARGET_HOST" || return 1
;&
verify)
stage_verify "$TARGET_HOST" || return 1
;&
complete)
stage_complete "$TARGET_HOST" || return 1
;;
*)
log "ERROR" "Invalid stage: $start_stage"
return 1
;;
esac
}
# --- Main ---
log "INFO" "--- Homelab Deployment Started (Host: $TARGET_HOST, Service: ${TARGET_SERVICE:-all}) ---"
if ! load_inventory "$TARGET_HOST" "$TARGET_SERVICE"; then
log "ERROR" "Failed to load inventory"
exit 1 exit 1
fi
EXIT_STATUS=0
if [[ "$RESUME" == "true" ]]; then
CURRENT=$(get_stage)
log "INFO" "Resuming from state: $CURRENT"
case "$CURRENT" in
prepare|validate|deploy|verify)
run_deployment "$CURRENT" || EXIT_STATUS=1
;;
complete|none)
log "INFO" "No interrupted deployment found. Starting from scratch..."
run_deployment "prepare" || EXIT_STATUS=1
;;
*)
log "INFO" "Unknown state. Starting from prepare..."
run_deployment "prepare" || EXIT_STATUS=1
;;
esac
elif [[ -n "$REQUESTED_STAGE" ]]; then
if [[ "$REQUESTED_STAGE" == "diagnose" ]]; then
collect_diagnostics "$TARGET_HOST" "$TARGET_SERVICE"
else
run_deployment "$REQUESTED_STAGE" || EXIT_STATUS=1
fi fi
else echo "[ok] branch: master"
# New deployment - clear previous state
clear_deployment_state if ! git -C "$REPO_ROOT" diff --quiet; then
run_deployment "prepare" || EXIT_STATUS=1 echo "ERROR: Unstaged changes in working tree. Commit or stash before deploying." >&2
exit 1
fi
if ! git -C "$REPO_ROOT" diff --cached --quiet; then
echo "ERROR: Staged but uncommitted changes. Commit before deploying." >&2
exit 1
fi
echo "[ok] working tree clean"
git -C "$REPO_ROOT" fetch origin master --quiet
local unpushed
unpushed=$(git -C "$REPO_ROOT" log origin/master..HEAD --oneline)
if [[ -n "$unpushed" ]]; then
echo "ERROR: Unpushed commits on master:" >&2
echo "$unpushed" >&2
echo "Push first: git push origin master" >&2
exit 1
fi
echo "[ok] no unpushed commits"
echo "Checking SSH: ${SSH_USER}@${SSH_HOST} (ConnectTimeout=${SSH_TIMEOUT}s)..."
if ! ssh -o "ConnectTimeout=${SSH_TIMEOUT}" -o BatchMode=yes \
"${SSH_USER}@${SSH_HOST}" true 2>/dev/null; then
echo "ERROR: Cannot reach ${SSH_HOST} via SSH (timeout ${SSH_TIMEOUT}s)." >&2
exit 1
fi
echo "[ok] ${SSH_HOST} reachable"
}
# ── GATE ─────────────────────────────────────────────────────────────────────
gate() {
if [[ "$NO_GATE" == "true" ]]; then
echo "=== GATE: SKIPPED ==="
echo "WARNING: --no-gate active — pytest + docker build bypassed (emergency mode)." >&2
return 0
fi
echo "=== GATE ==="
local services=()
if [[ "$TARGET" == "control-plane" ]]; then
services=("control-plane")
else
local svc_yaml="${REPO_ROOT}/hosts/${TARGET}/services.yaml"
if [[ ! -f "$svc_yaml" ]]; then
echo "ERROR: ${svc_yaml} not found." >&2
exit 2
fi
local svc_list
svc_list=$(python3 -c "
import yaml
with open('${svc_yaml}') as f:
data = yaml.safe_load(f)
svcs = data.get('services', {})
if isinstance(svcs, dict):
print('\n'.join(svcs.keys()))
elif isinstance(svcs, list):
print('\n'.join(svcs))
")
while IFS= read -r svc; do
[[ -z "$svc" ]] && continue
if [[ -f "${REPO_ROOT}/services/${svc}/Dockerfile" ]]; then
services+=("$svc")
fi
done <<< "$svc_list"
fi
if [[ ${#services[@]} -eq 0 ]]; then
echo "[info] No services with local Dockerfile found for ${TARGET} — gate trivially passes."
return 0
fi
echo "Services under gate: ${services[*]}"
local gate_failed=false
for svc in "${services[@]}"; do
local svc_dir="${REPO_ROOT}/services/${svc}"
if [[ -d "${svc_dir}/tests" ]]; then
echo "--- pytest: ${svc} ---"
if ! python3 -m pytest "${svc_dir}/tests" -q; then
echo "GATE FAIL: pytest failed for ${svc}" >&2
gate_failed=true
fi
fi
echo "--- docker build: ${svc} ---"
if ! docker build --quiet "${svc_dir}" >/dev/null; then
echo "GATE FAIL: docker build failed for ${svc}" >&2
gate_failed=true
fi
done
if [[ "$gate_failed" == "true" ]]; then
exit 2
fi
echo "[ok] gate passed"
}
# ── EXECUTE ──────────────────────────────────────────────────────────────────
execute() {
echo "=== EXECUTE ==="
local cmd_output
local cmd_exit=0
if [[ "$TARGET" == "control-plane" ]]; then
echo "Running deploy-control-plane.sh --ssh..."
cmd_output=$("${REPO_ROOT}/scripts/deploy/deploy-control-plane.sh" --ssh 2>&1) \
|| cmd_exit=$?
else
echo "SSHing to ${SSH_HOST}: git pull + deploy-node.sh..."
cmd_output=$(ssh -o "ConnectTimeout=${SSH_TIMEOUT}" -o BatchMode=yes \
"${SSH_USER}@${SSH_HOST}" \
'cd ~/homelab-codex-ws && git pull && ./scripts/deploy/deploy-node.sh' 2>&1) \
|| cmd_exit=$?
fi
echo "$cmd_output"
if echo "$cmd_output" | grep -qF "[sudo] password"; then
echo "" >&2
echo "ERROR (exit 5): Deploy hit an interactive sudo prompt." >&2
echo "Run manually:" >&2
if [[ "$TARGET" == "control-plane" ]]; then
echo " ssh -t ${SSH_USER}@${SSH_HOST} 'cd ~/homelab-codex-ws && git pull origin master && cd services/control-plane && bash deploy-local.sh'" >&2
else
echo " ssh -t ${SSH_USER}@${SSH_HOST} 'cd ~/homelab-codex-ws && git pull && ./scripts/deploy/deploy-node.sh'" >&2
fi
exit 5
fi
if [[ $cmd_exit -ne 0 ]]; then
echo "ERROR: Deploy command exited ${cmd_exit}." >&2
exit 3
fi
echo "[ok] execute completed"
}
# ── VERIFY ───────────────────────────────────────────────────────────────────
verify() {
echo "=== VERIFY ==="
local ps_output
local ps_exit=0
ps_output=$(ssh -o "ConnectTimeout=${SSH_TIMEOUT}" -o BatchMode=yes \
"${SSH_USER}@${SSH_HOST}" \
'docker ps --format "{{.Names}}\t{{.Status}}"' 2>&1) \
|| ps_exit=$?
if [[ $ps_exit -ne 0 ]]; then
echo "ERROR: docker ps failed on ${SSH_HOST}:" >&2
echo "$ps_output" >&2
exit 4
fi
echo "$ps_output"
local failed=false
local not_up
not_up=$(echo "$ps_output" | grep -v '^$' | grep -v $'\tUp' || true)
if [[ -n "$not_up" ]]; then
echo "ERROR: Containers not in Up state:" >&2
echo "$not_up" >&2
failed=true
fi
local unhealthy
unhealthy=$(echo "$ps_output" | grep '(unhealthy)' || true)
if [[ -n "$unhealthy" ]]; then
echo "ERROR: Unhealthy containers:" >&2
echo "$unhealthy" >&2
failed=true
fi
if [[ "$TARGET" == "control-plane" ]]; then
for cp_svc in supervisor observer executor operator-ui; do
if ! echo "$ps_output" | grep -q "$cp_svc"; then
echo "ERROR: control-plane component absent from docker ps: ${cp_svc}" >&2
failed=true
fi
done
fi
if [[ "$failed" == "true" ]]; then
echo "" >&2
echo "Full docker ps output above." >&2
exit 4
fi
echo "[ok] all containers healthy"
}
# ── REPORT ───────────────────────────────────────────────────────────────────
report() {
local mode="${1:-deploy}"
local end_time
end_time=$(date +%s)
local elapsed
elapsed=$(( end_time - START_TIME ))
local commit_hash
commit_hash=$(git -C "$REPO_ROOT" rev-parse --short HEAD)
local gate_s verify_s
if [[ "$NO_GATE" == "true" ]]; then
gate_s="skip"
else
gate_s="ok"
fi
if [[ "$mode" == "dry-run" ]]; then
verify_s="skip(dry-run)"
else
verify_s="green"
fi
echo ""
if [[ "$mode" == "dry-run" ]]; then
echo "DRY RUN OK | target=${TARGET} | commit=${commit_hash} | gate=${gate_s} | verify=${verify_s} | ${elapsed}s"
else
echo "DEPLOY OK | target=${TARGET} | commit=${commit_hash} | gate=${gate_s} | verify=${verify_s} | ${elapsed}s"
fi
}
# ── MAIN ─────────────────────────────────────────────────────────────────────
preflight
gate
if [[ "$DRY_RUN" == "true" ]]; then
report dry-run
exit 0
fi fi
if [[ $EXIT_STATUS -eq 0 ]]; then execute
print_summary "$TARGET_HOST" "SUCCESS" verify
log "INFO" "--- Homelab Deployment Finished Successfully ---" report
else
print_summary "$TARGET_HOST" "FAILED"
log "ERROR" "--- Homelab Deployment Failed ---"
exit 1
fi

View file

@ -1,15 +1,30 @@
#!/usr/bin/env bash #!/usr/bin/env bash
# orchestrate-deploy.sh - To be run on SATURN # orchestrate-deploy.sh - To be run on SATURN
# Triggers deployment on remote execution nodes. # Triggers deployment on remote execution nodes via inventory.
set -e set -e
HOSTS=("solaria" "piha" "vps") REPO_PATH="${HOME}/homelab-codex-ws"
USER="oskar" # Default user USER="oskar"
for HOST in "${HOSTS[@]}"; do while IFS=' ' read -r HOST TAG; do
echo ">>> Triggering deployment on ${HOST}..." echo ">>> Triggering deployment on ${HOST}..."
if [[ "$TAG" == "lte" ]]; then
ssh -o ConnectTimeout=30 "${USER}@${HOST}" "bash ~/homelab-codex-ws/scripts/deploy/deploy-node.sh" || \
echo "WARNING: Deployment on ${HOST} failed or timed out (LTE/intermittent node, skipping)"
else
ssh "${USER}@${HOST}" "bash ~/homelab-codex-ws/scripts/deploy/deploy-node.sh" ssh "${USER}@${HOST}" "bash ~/homelab-codex-ws/scripts/deploy/deploy-node.sh"
done fi
done < <(python3 -c "
import yaml, sys
with open('${REPO_PATH}/inventory/topology.yaml') as f:
data = yaml.safe_load(f)
skip = {'saturn', 'solaria'}
for name, info in (data.get('nodes') or {}).items():
if name in skip:
continue
uplink = ((info or {}).get('connectivity') or {}).get('uplink', '')
print(name, 'lte' if uplink == 'lte' else 'standard')
")
echo ">>> All deployments triggered." echo ">>> All deployments triggered."

View file

@ -0,0 +1,68 @@
#!/usr/bin/env bash
# verify-agent-fleet.sh - Check the status of stability agents across the fleet
REDIS_CMD="docker exec agent-system-redis redis-cli --raw"
# Check if docker is available
if ! command -v docker &> /dev/null; then
echo "Error: docker command not found."
exit 1
fi
# Check if container is running
if ! docker ps --filter "name=agent-system-redis" --format "{{.Names}}" | grep -q "agent-system-redis"; then
echo "Error: agent-system-redis container not found or not running."
echo "This script must be run on PIHA (the node hosting the Redis container)."
exit 1
fi
REQUIRED_NODES=("piha" "chelsty" "solaria" "vps")
MISSING_NODES=0
echo "--- Homelab Agent Fleet Status ---"
printf "%-10s %-15s %-10s %-10s %-30s\n" "NODE" "HOSTNAME" "HEALTH" "STATUS" "LAST_SEEN"
printf "%s\n" "--------------------------------------------------------------------------------"
for NODE in "${REQUIRED_NODES[@]}"; do
KEY="homelab:nodes:$NODE"
# Check if key exists
EXISTS=$($REDIS_CMD EXISTS "$KEY" 2>/dev/null | tr -d '\r\n')
if [[ "$EXISTS" != "1" ]]; then
printf "%-10s %-15s %-10s %-10s %-30s\n" "$NODE" "MISSING" "N/A" "N/A" "N/A"
MISSING_NODES=$((MISSING_NODES + 1))
continue
fi
HOSTNAME=$($REDIS_CMD HGET "$KEY" hostname 2>/dev/null | tr -d '\r\n')
HEALTH=$($REDIS_CMD HGET "$KEY" health 2>/dev/null | tr -d '\r\n')
STATUS=$($REDIS_CMD HGET "$KEY" status 2>/dev/null | tr -d '\r\n')
LAST_SEEN=$($REDIS_CMD HGET "$KEY" last_seen 2>/dev/null | tr -d '\r\n')
printf "%-10s %-15s %-10s %-10s %-30s\n" "$NODE" "$HOSTNAME" "$HEALTH" "$STATUS" "$LAST_SEEN"
done
echo ""
echo "--- Control Plane Summary ---"
if command -v jq >/dev/null; then
curl -s http://127.0.0.1:18180/summary | jq .
else
curl -s http://127.0.0.1:18180/summary
fi
echo ""
echo "--- Control Plane Nodes ---"
if command -v jq >/dev/null; then
curl -s http://127.0.0.1:18180/nodes | jq .
else
curl -s http://127.0.0.1:18180/nodes
fi
if [[ $MISSING_NODES -gt 0 ]]; then
echo ""
echo "Error: $MISSING_NODES required nodes are missing from Redis."
exit 1
fi
exit 0

361
scripts/dev/agent.sh Executable file
View file

@ -0,0 +1,361 @@
#!/usr/bin/env bash
# Multi-agent worktree manager.
# EXIT: 0 ok, 1 preflight, 2 operation failed.
set -euo pipefail
trap 'echo "agent.sh: failed at line $LINENO (exit $?)" >&2' ERR
RESERVED_NAMES=(master main HEAD list merge clean new)
MAX_WORKTREES=4
die() { echo "ERROR: $*" >&2; exit "${2:-2}"; }
prefail(){ echo "PREFLIGHT: $*" >&2; exit 1; }
# ── helpers ──────────────────────────────────────────────────────────────────
is_main_checkout() {
local git_dir common_dir
git_dir=$(git rev-parse --git-dir 2>/dev/null) || return 1
common_dir=$(git rev-parse --git-common-dir 2>/dev/null) || return 1
[ "$git_dir" = "$common_dir" ]
}
require_main_checkout() {
is_main_checkout || prefail "must run from the main checkout, not a worktree"
}
require_master_branch() {
local branch
branch=$(git rev-parse --abbrev-ref HEAD)
[ "$branch" = "master" ] || prefail "must be on master (currently on '$branch')"
}
require_clean_tree() {
local dirty
dirty=$(git status --porcelain)
[ -z "$dirty" ] || prefail "working tree is not clean — stash or commit first"
}
worktree_paths() {
# list worktree paths (excluding main); || true prevents grep exit-1 when empty
local main_path
main_path=$(git rev-parse --show-toplevel)
git worktree list --porcelain \
| awk '/^worktree /{p=$2} /^$/{print p}' \
| grep -v "^${main_path}$" \
|| true
}
worktree_count() {
worktree_paths | wc -l
}
branch_exists_local() { git show-ref --verify --quiet "refs/heads/$1"; }
branch_exists_remote() { git ls-remote --exit-code origin "$1" >/dev/null 2>&1; }
utc_now() { date -u +"%Y-%m-%dT%H:%M:%SZ"; }
age_str() {
local created_utc="$1"
local now_ts created_ts diff_s
now_ts=$(date -u +%s)
# strip Z, replace T with space for `date -d`
created_ts=$(date -u -d "${created_utc//T/ }" +%s 2>/dev/null) || { echo "?"; return; }
diff_s=$(( now_ts - created_ts ))
if (( diff_s < 60 )); then echo "${diff_s}s"
elif (( diff_s < 3600 )); then echo "$(( diff_s/60 ))m"
elif (( diff_s < 86400 )); then echo "$(( diff_s/3600 ))h"
else echo "$(( diff_s/86400 ))d"
fi
}
validate_name() {
local name="$1"
if ! [[ "$name" =~ ^[a-z][a-z0-9-]*$ ]]; then
prefail "name '$name' must match ^[a-z][a-z0-9-]*$"
fi
for r in "${RESERVED_NAMES[@]}"; do
if [ "$name" = "$r" ]; then
prefail "'$name' is a reserved word"
fi
done
}
# ── subcommands ───────────────────────────────────────────────────────────────
cmd_new() {
local name="${1:-}"
[ -n "$name" ] || { usage; exit 1; }
validate_name "$name"
require_main_checkout
require_master_branch
require_clean_tree
# worktree limit
local count
count=$(worktree_count)
if (( count >= MAX_WORKTREES )); then
echo "ERROR: already at maximum of $MAX_WORKTREES active worktrees:" >&2
cmd_list
exit 1
fi
# branch collision
if branch_exists_local "task/$name"; then
prefail "branch task/$name already exists locally"
fi
git fetch origin master --quiet
if branch_exists_remote "refs/heads/task/$name"; then
prefail "branch task/$name already exists on origin"
fi
# directory collision
local main_path wt_path
main_path=$(git rev-parse --show-toplevel)
wt_path="$(dirname "$main_path")/homelab-codex-ws-${name}"
[ ! -e "$wt_path" ] || prefail "directory $wt_path already exists"
# create worktree
git worktree add -b "task/$name" "$wt_path" origin/master \
|| die "git worktree add failed"
# write marker
local parent_commit
parent_commit=$(git rev-parse origin/master)
cat > "$wt_path/.agent-task" <<EOF
task: $name
branch: task/$name
parent_commit: $parent_commit
created_utc: $(utc_now)
worktree_path: $wt_path
EOF
echo ""
echo "Worktree created: $wt_path"
echo "Branch: task/$name"
echo ""
echo "── Start Claude Code in this worktree ──────────────────────────────────────"
echo "cd ~/homelab-codex-ws-${name} && claude --dangerously-skip-permissions \"Jesteś w worktree task '${name}' (branch task/${name}). NAJPIERW przeczytaj .agent-task i .claude/skills/worktree-aware/SKILL.md, dopiero potem zacznij pracę. Commituj wyłącznie na swoją gałąź; nie pushuj origin master.\""
echo "─────────────────────────────────────────────────────────────────────────────"
}
cmd_list() {
local main_path
main_path=$(git rev-parse --show-toplevel)
# fetch to get up-to-date ahead/behind
git fetch origin master --quiet 2>/dev/null || true
local paths
paths=$(worktree_paths)
if [ -z "$paths" ]; then
echo "(no active task worktrees)"
return
fi
printf "%-20s %-25s %-10s %-8s %-8s %-7s %s\n" \
"NAME" "BRANCH" "CREATED" "AGE" "STATUS" "A/B" "PARENT"
while IFS= read -r wt_path; do
[ -z "$wt_path" ] && continue
local marker="$wt_path/.agent-task"
local task_name branch parent_commit created_utc
if [ -f "$marker" ]; then
task_name=$( grep '^task:' "$marker" | awk '{print $2}')
branch=$( grep '^branch:' "$marker" | awk '{print $2}')
parent_commit=$(grep '^parent_commit:' "$marker" | awk '{print $2}')
created_utc=$(grep '^created_utc:' "$marker" | awk '{print $2}')
else
task_name="(no marker)"
branch=$(git -C "$wt_path" rev-parse --abbrev-ref HEAD 2>/dev/null || echo "?")
parent_commit="?"
created_utc=""
fi
local status="clean"
local dirty
dirty=$(git -C "$wt_path" status --porcelain 2>/dev/null || echo "?")
[ -n "$dirty" ] && status="dirty"
local ahead behind ab
ahead=$(git -C "$wt_path" rev-list --count "origin/master..${branch}" 2>/dev/null || echo "?")
behind=$(git -C "$wt_path" rev-list --count "${branch}..origin/master" 2>/dev/null || echo "?")
ab="+${ahead}/-${behind}"
local age=""
[ -n "$created_utc" ] && age=$(age_str "$created_utc")
local short_parent="${parent_commit:0:7}"
local short_created="${created_utc:0:10}"
printf "%-20s %-25s %-10s %-8s %-8s %-7s %s\n" \
"$task_name" "$branch" "$short_created" "$age" "$status" "$ab" "$short_parent"
done <<< "$paths"
}
cmd_merge() {
local name="${1:-}"
[ -n "$name" ] || { usage; exit 1; }
require_main_checkout
require_master_branch
require_clean_tree
git fetch origin --quiet
branch_exists_local "task/$name" || die "branch task/$name not found locally" 1
local main_path wt_path
main_path=$(git rev-parse --show-toplevel)
wt_path="$(dirname "$main_path")/homelab-codex-ws-${name}"
# attempt ff-only merge
local merge_failed=0
git merge --ff-only "task/$name" || merge_failed=1
if (( merge_failed )); then
# abort any partial merge state
git merge --abort 2>/dev/null || true
echo ""
echo "ERROR: task/$name cannot be fast-forwarded into master." >&2
echo " The branch has likely diverged from master." >&2
echo "" >&2
echo "Diagnose with:" >&2
echo " git log master..task/$name # commits only on task branch" >&2
echo " git log task/$name..master # commits master has that task doesn't" >&2
echo "" >&2
echo "Then decide: rebase task/$name onto master, or merge manually." >&2
echo "Worktree and branch are preserved — no changes made." >&2
exit 2
fi
echo "Merged task/$name into master (fast-forward)."
git push origin master || die "git push origin master failed"
echo "Pushed master to origin."
if [ -d "$wt_path" ]; then
git worktree remove "$wt_path" || die "git worktree remove $wt_path failed"
echo "Removed worktree: $wt_path"
else
echo "(worktree directory $wt_path not found — skipping worktree remove)"
fi
git branch -d "task/$name" || die "git branch -d task/$name failed"
echo "Deleted local branch task/$name."
git push origin --delete "task/$name" 2>/dev/null \
&& echo "Deleted remote branch task/$name." \
|| echo "(remote branch task/$name not found — nothing to delete)"
echo ""
echo "Done. task/$name merged and cleaned up."
}
cmd_clean() {
local main_path
main_path=$(git rev-parse --show-toplevel)
git fetch origin --quiet 2>/dev/null || true
local to_remove=()
# orphaned registered worktrees: branch deleted or fully merged into master
local paths
paths=$(worktree_paths)
while IFS= read -r wt_path; do
[ -z "$wt_path" ] && continue
local branch
branch=$(git -C "$wt_path" rev-parse --abbrev-ref HEAD 2>/dev/null || echo "")
[ -z "$branch" ] && { to_remove+=("worktree:$wt_path (unreadable branch)"); continue; }
# branch gone locally?
if ! branch_exists_local "$branch"; then
to_remove+=("worktree:$wt_path (branch $branch no longer exists)")
continue
fi
# branch fully merged into master?
local ahead
ahead=$(git rev-list --count "origin/master..${branch}" 2>/dev/null || echo "1")
if [ "$ahead" = "0" ]; then
to_remove+=("worktree:$wt_path (branch $branch fully merged into origin/master)")
fi
done <<< "$paths"
# dangling directories: ../homelab-codex-ws-* not registered
local registered_paths
registered_paths=$(git worktree list --porcelain | awk '/^worktree /{print $2}')
local parent_dir
parent_dir=$(dirname "$main_path")
while IFS= read -r candidate; do
[ -d "$candidate" ] || continue
if ! echo "$registered_paths" | grep -qF "$candidate"; then
to_remove+=("dangling:$candidate")
fi
done < <(find "$parent_dir" -maxdepth 1 -name "homelab-codex-ws-*" -type d 2>/dev/null)
if [ ${#to_remove[@]} -eq 0 ]; then
echo "Nothing to clean."
return 0
fi
echo "Found ${#to_remove[@]} item(s) to clean:"
for entry in "${to_remove[@]}"; do
echo " $entry"
done
echo ""
local overall_rc=0
for entry in "${to_remove[@]}"; do
local kind="${entry%%:*}"
local path="${entry#*:}"
# strip trailing annotation in parens
local raw_path
raw_path="${path%% (*}"
local confirm
read -r -p "Remove $kind '$raw_path'? [y/N] " confirm
if [[ "$confirm" =~ ^[Yy]$ ]]; then
if [ "$kind" = "worktree" ]; then
git worktree remove --force "$raw_path" 2>/dev/null \
|| { echo " WARNING: git worktree remove failed, trying rm -rf"; rm -rf "$raw_path" || true; }
else
rm -rf "$raw_path"
fi
echo " Removed."
else
echo " Skipped."
fi
done
return $overall_rc
}
usage() {
cat <<'EOF'
Usage: agent.sh <subcommand> [args]
agent.sh new <name> Create a new task worktree (branch task/<name>)
agent.sh list List active task worktrees with status
agent.sh merge <name> Fast-forward merge task/<name> into master and clean up
agent.sh clean Remove orphaned or dangling worktrees (interactive)
EXIT: 0 ok, 1 preflight, 2 operation failed.
EOF
}
# ── dispatch ──────────────────────────────────────────────────────────────────
SUBCOMMAND="${1:-}"
shift || true
case "$SUBCOMMAND" in
new) cmd_new "$@" ;;
list) cmd_list "$@" ;;
merge) cmd_merge "$@" ;;
clean) cmd_clean "$@" ;;
*) usage; exit 1 ;;
esac

338
scripts/monitor/health-monitor.sh Executable file
View file

@ -0,0 +1,338 @@
#!/usr/bin/env bash
# health-monitor.sh - Homelab node health monitor and safe disk cleanup
#
# Designed to run standalone on the host (cron or direct) or to be called by
# the node-agent Python daemon. All cleanup decisions follow the conservative
# policy agreed in the design review:
#
# lte_node (chelsty-infra, chelsty-ha) : NO cleanup at all
# sd_card (piha, saturn) : dangling images + stopped containers,
# rate-limited to once per 24 h
# ai_node (solaria) : dangling images + stopped containers
# + build cache (NEVER -a)
# standard (vps) : dangling images + stopped containers
# + build cache
#
# VPS additionally rotates control-plane filesystem artefacts:
# actions/completed + failed > 7 days
# logs/deploy > 30 days
# events/** > 3 days AND past observer checkpoint
#
# NEVER TOUCHED (any node): /opt/homelab/data/, config/, state/,
# actions/pending|approved|running, Frigate recordings, Ollama models,
# Zigbee2MQTT data, Mosquitto data, HA database/config.
set -euo pipefail
# ---------------------------------------------------------------------------
# Configuration
# ---------------------------------------------------------------------------
RUNTIME_PATH="${RUNTIME_PATH:-/opt/homelab}"
EVENTS_DIR="${RUNTIME_PATH}/events"
STATE_DIR="${RUNTIME_PATH}/state"
LOGS_DIR="${RUNTIME_PATH}/logs"
ACTIONS_DIR="${RUNTIME_PATH}/actions"
NODE_NAME="${NODE_NAME:-$(hostname)}"
TIMESTAMP=$(date +%s)
DATE=$(date -u +%Y-%m-%dT%H:%M:%SZ)
# Thresholds
DISK_WARN_PCT=75
DISK_CRIT_PCT=85
MEM_WARN_PCT=85
MEM_CRIT_PCT=95
# Rate-limit file for SD-card nodes (max one Docker cleanup per 24 h)
CLEANUP_LOCK="${STATE_DIR}/last-docker-cleanup"
CLEANUP_INTERVAL=86400 # seconds
# Node classifications
LTE_NODES="chelsty-infra chelsty-ha"
SD_CARD_NODES="piha saturn"
AI_NODES="solaria"
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
log() { echo "$(date -u +%H:%M:%S) [INFO] $*"; }
warn() { echo "$(date -u +%H:%M:%S) [WARN] $*" >&2; }
err() { echo "$(date -u +%H:%M:%S) [ERROR] $*" >&2; }
contains() {
local word="$1"; shift
for w in "$@"; do [[ "$w" == "$word" ]] && return 0; done
return 1
}
get_node_type() {
# shellcheck disable=SC2086
if contains "$NODE_NAME" $LTE_NODES; then echo "lte_node"; return; fi
if contains "$NODE_NAME" $SD_CARD_NODES; then echo "sd_card"; return; fi
if contains "$NODE_NAME" $AI_NODES; then echo "ai_node"; return; fi
echo "standard"
}
# ---------------------------------------------------------------------------
# Event emission
# ---------------------------------------------------------------------------
emit_event() {
local type="$1" severity="$2" service="${3:-}" message="$4" payload="${5:-{}}"
local id="evt-${NODE_NAME}-${TIMESTAMP}-${type}"
local dir="${EVENTS_DIR}/${NODE_NAME}"
mkdir -p "$dir"
cat > "${dir}/${id}.json" <<EOF
{
"id": "${id}",
"timestamp": ${TIMESTAMP},
"date": "${DATE}",
"type": "${type}",
"severity": "${severity}",
"node": "${NODE_NAME}",
"service": "${service}",
"message": "${message}",
"payload": ${payload}
}
EOF
}
# ---------------------------------------------------------------------------
# Health checks
# ---------------------------------------------------------------------------
check_disk() {
# Use /opt/homelab as the check target — it lives on the host filesystem
# and this path is correct both when running natively and in a container
# that mounts /opt/homelab from the host.
local mount="${RUNTIME_PATH}"
local usage_pct avail_mb total_mb
usage_pct=$(df "${mount}" 2>/dev/null | awk 'NR==2 {gsub(/%/,"",$5); print $5}') || return
avail_mb=$(df "${mount}" 2>/dev/null | awk 'NR==2 {printf "%d", $4/1024}') || return
total_mb=$(df "${mount}" 2>/dev/null | awk 'NR==2 {printf "%d", $2/1024}') || return
if [[ "${usage_pct}" -ge "${DISK_CRIT_PCT}" ]]; then
warn "Disk CRITICAL: ${usage_pct}% used (${avail_mb} MB free)"
emit_event "disk_pressure" "high" "" \
"Disk usage critical: ${usage_pct}% on ${mount} (${avail_mb} MB free)" \
"{\"usage_pct\": ${usage_pct}, \"avail_mb\": ${avail_mb}, \"total_mb\": ${total_mb}, \"mount\": \"${mount}\"}"
elif [[ "${usage_pct}" -ge "${DISK_WARN_PCT}" ]]; then
warn "Disk elevated: ${usage_pct}% used"
emit_event "disk_pressure" "medium" "" \
"Disk usage elevated: ${usage_pct}% on ${mount} (${avail_mb} MB free)" \
"{\"usage_pct\": ${usage_pct}, \"avail_mb\": ${avail_mb}, \"total_mb\": ${total_mb}, \"mount\": \"${mount}\"}"
fi
echo "${usage_pct}"
}
check_memory() {
local total avail pct avail_mb
total=$(awk '/^MemTotal/ {print $2}' /proc/meminfo)
avail=$(awk '/^MemAvailable/ {print $2}' /proc/meminfo)
pct=$(( (total - avail) * 100 / total ))
avail_mb=$(( avail / 1024 ))
if [[ "${pct}" -ge "${MEM_CRIT_PCT}" ]]; then
warn "Memory CRITICAL: ${pct}% used"
emit_event "high_memory" "high" "" \
"Memory usage critical: ${pct}% (${avail_mb} MB available)" \
"{\"usage_pct\": ${pct}, \"avail_mb\": ${avail_mb}, \"total_mb\": $((total/1024))}"
elif [[ "${pct}" -ge "${MEM_WARN_PCT}" ]]; then
warn "Memory elevated: ${pct}%"
emit_event "high_memory" "medium" "" \
"Memory usage elevated: ${pct}% (${avail_mb} MB available)" \
"{\"usage_pct\": ${pct}, \"avail_mb\": ${avail_mb}, \"total_mb\": $((total/1024))}"
fi
echo "${pct}"
}
check_cpu() {
# Two-sample /proc/stat delta for accurate instantaneous CPU usage.
local idle1 total1 idle2 total2 pct
read -r idle1 total1 < <(awk '/^cpu / {idle=$5; total=0; for(i=2;i<=NF;i++) total+=$i; print idle, total}' /proc/stat)
sleep 1
read -r idle2 total2 < <(awk '/^cpu / {idle=$5; total=0; for(i=2;i<=NF;i++) total+=$i; print idle, total}' /proc/stat)
local d_idle=$(( idle2 - idle1 ))
local d_total=$(( total2 - total1 ))
pct=$(( d_total > 0 ? 100 - d_idle * 100 / d_total : 0 ))
if [[ "${pct}" -ge 90 ]]; then
warn "CPU elevated: ${pct}%"
emit_event "high_cpu" "medium" "" \
"CPU usage elevated: ${pct}%" \
"{\"usage_pct\": ${pct}}"
fi
echo "${pct}"
}
check_containers() {
command -v docker &>/dev/null || return
# Containers that have exited but carry a restart policy meaning they should be up
local cname
while IFS= read -r cname; do
[[ -z "$cname" ]] && continue
warn "Container exited (should be running): ${cname}"
emit_event "containers_not_running" "high" "${cname}" \
"Container '${cname}' has exited unexpectedly (restart=unless-stopped)" \
"{\"container\": \"${cname}\"}"
done < <(docker ps -a \
--filter "status=exited" \
--filter "label=com.docker.compose.project" \
--format "{{.Names}}" 2>/dev/null || true)
# Containers that are running but their health check is failing
while IFS= read -r cname; do
[[ -z "$cname" ]] && continue
warn "Container unhealthy: ${cname}"
emit_event "healthcheck_failed" "high" "${cname}" \
"Container '${cname}' is running but health check is failing" \
"{\"container\": \"${cname}\"}"
done < <(docker ps \
--filter "health=unhealthy" \
--format "{{.Names}}" 2>/dev/null || true)
}
# ---------------------------------------------------------------------------
# Safe Docker cleanup (per policy)
# ---------------------------------------------------------------------------
_sd_card_rate_ok() {
if [[ -f "${CLEANUP_LOCK}" ]]; then
local last_ts elapsed
last_ts=$(cat "${CLEANUP_LOCK}" 2>/dev/null || echo 0)
elapsed=$(( TIMESTAMP - last_ts ))
if [[ "${elapsed}" -lt "${CLEANUP_INTERVAL}" ]]; then
log "Docker cleanup skipped: last run ${elapsed}s ago (limit ${CLEANUP_INTERVAL}s)"
return 1
fi
fi
return 0
}
_mark_cleanup_done() {
echo "${TIMESTAMP}" > "${CLEANUP_LOCK}"
}
run_safe_cleanup() {
command -v docker &>/dev/null || return
local node_type
node_type=$(get_node_type)
case "${node_type}" in
lte_node)
# NO cleanup on LTE nodes. Any docker operation risks triggering
# a pull over a metered/intermittent connection.
log "Skipping Docker cleanup: LTE node (${NODE_NAME})"
;;
sd_card)
# Dangling images + stopped containers only.
# Rate-limited to once per 24 hours to protect SD card write endurance.
_sd_card_rate_ok || return
log "Running rate-limited Docker cleanup (SD card node)"
docker image prune -f >/dev/null 2>&1 || true
docker container prune -f >/dev/null 2>&1 || true
_mark_cleanup_done
;;
ai_node)
# Dangling images + stopped containers + build cache.
# NEVER docker image prune -a (would remove Ollama runtime images,
# requiring a multi-hour re-pull of model weights).
log "Running AI-node Docker cleanup (dangling images + containers + build cache)"
docker image prune -f >/dev/null 2>&1 || true
docker container prune -f >/dev/null 2>&1 || true
docker builder prune -f >/dev/null 2>&1 || true
;;
standard)
# VPS and other standard nodes: full safe cleanup.
log "Running standard Docker cleanup"
docker image prune -f >/dev/null 2>&1 || true
docker container prune -f >/dev/null 2>&1 || true
docker builder prune -f >/dev/null 2>&1 || true
;;
esac
}
# ---------------------------------------------------------------------------
# VPS-specific: control-plane filesystem rotation
# ---------------------------------------------------------------------------
cleanup_control_plane_fs() {
log "Running control-plane filesystem rotation"
# Completed / failed actions older than 7 days
for status in completed failed; do
local dir="${ACTIONS_DIR}/${status}"
[[ -d "${dir}" ]] || continue
find "${dir}" -name "*.json" -mtime +7 -delete 2>/dev/null && \
log "Cleaned ${status} actions older than 7 days" || true
done
# Deploy logs older than 30 days
local deploy_logs="${LOGS_DIR}/deploy"
if [[ -d "${deploy_logs}" ]]; then
find "${deploy_logs}" -name "*.log" -mtime +30 -delete 2>/dev/null && \
log "Cleaned deploy logs older than 30 days" || true
fi
# Event files older than 3 days AND already past the observer checkpoint.
# The dual condition ensures we never delete an event the observer hasn't seen.
local checkpoint="${STATE_DIR}/observer_checkpoint.json"
if [[ -f "${checkpoint}" ]] && command -v python3 &>/dev/null; then
local last_processed
last_processed=$(python3 -c "
import json, sys
try:
d = json.load(open('${checkpoint}'))
print(d.get('last_processed_file', ''))
except Exception:
print('')
" 2>/dev/null || echo "")
if [[ -n "${last_processed}" ]]; then
find "${EVENTS_DIR}" -name "*.json" -mtime +3 | while IFS= read -r f; do
# Only delete files that sort before the checkpoint path
# (i.e., the observer has already processed them).
if [[ "$f" < "${last_processed}" ]]; then
rm -f "$f"
log "Cleaned old event: $(basename "$f")"
fi
done
else
log "No observer checkpoint set; skipping event file cleanup"
fi
fi
}
# ---------------------------------------------------------------------------
# Main
# ---------------------------------------------------------------------------
mkdir -p "${EVENTS_DIR}/${NODE_NAME}" "${STATE_DIR}"
log "Health check starting on ${NODE_NAME} (type=$(get_node_type))"
disk_pct=$(check_disk || echo 0)
mem_pct=$(check_memory || echo 0)
cpu_pct=$(check_cpu || echo 0)
check_containers
run_safe_cleanup
# VPS: also rotate control-plane filesystem artefacts
if [[ "${NODE_NAME}" == "vps" ]]; then
cleanup_control_plane_fs
fi
# Emit a node_health heartbeat so the observer can update node status
# and the supervisor can see up-to-date resource metrics.
emit_event "node_health" "info" "" \
"Health check completed on ${NODE_NAME}" \
"{\"disk_pct\": ${disk_pct}, \"mem_pct\": ${mem_pct}, \"cpu_pct\": ${cpu_pct}}"
log "Health check complete (disk=${disk_pct}% mem=${mem_pct}% cpu=${cpu_pct}%)"

View file

@ -7,6 +7,34 @@ import yaml
from datetime import datetime, timezone from datetime import datetime, timezone
from pathlib import Path from pathlib import Path
def _atomic_write_json(path: Path, data) -> None:
"""Write JSON atomically: write to a sibling .tmp, fsync, then os.replace."""
tmp = path.with_suffix(".tmp")
with open(tmp, "w") as f:
json.dump(data, f, indent=2)
f.flush()
os.fsync(f.fileno())
os.replace(tmp, path)
def _parse_ts(ts) -> float:
"""Return a Unix timestamp float from ts, which may be int/float or an ISO-8601 string.
Events from node-agent use int(time.time()); events from stability-agent / events.py
use ISO format ('2026-06-03T10:30:00Z'). Both appear in incident fields such as
last_occurrence and resolved_at, so any arithmetic on them must go through here.
Returns 0.0 on None or unparseable input so callers can use plain comparisons.
"""
if ts is None:
return 0.0
if isinstance(ts, (int, float)):
return float(ts)
try:
return datetime.fromisoformat(str(ts).replace("Z", "+00:00")).timestamp()
except Exception:
return 0.0
# Constants and Paths # Constants and Paths
RUNTIME_PATH = os.getenv("RUNTIME_PATH", "/opt/homelab") RUNTIME_PATH = os.getenv("RUNTIME_PATH", "/opt/homelab")
EVENTS_DIR = Path(RUNTIME_PATH) / "events" EVENTS_DIR = Path(RUNTIME_PATH) / "events"
@ -24,14 +52,17 @@ logger = logging.getLogger("observer")
class Observer: class Observer:
def __init__(self): def __init__(self):
self.last_processed_file = None # Per-node-directory checkpoint: {"vps": "last/file/path", "piha": "last/file/path"}
# Replaces the old single last_processed_file which silently skipped event dirs
# that sort alphabetically before the checkpoint (e.g. piha/ < vps/).
self.node_checkpoints: dict = {}
self.world_state = { self.world_state = {
"nodes": {}, "nodes": {},
"services": {}, "services": {},
"deployments": {}, "deployments": {},
"incidents": {}, "incidents": {},
"summary": { "summary": {
"last_update": None, "last_update": datetime.now(timezone.utc).isoformat(),
"status": "initializing", "status": "initializing",
"active_incidents_count": 0 "active_incidents_count": 0
} }
@ -83,10 +114,21 @@ class Observer:
try: try:
with open(OBSERVER_STATE_FILE, "r") as f: with open(OBSERVER_STATE_FILE, "r") as f:
checkpoint = json.load(f) checkpoint = json.load(f)
self.last_processed_file = checkpoint.get("last_processed_file")
# We might want to persist partial world state, if "node_checkpoints" in checkpoint:
# but for now we rebuild from events (idempotent) # New format: per-directory checkpoints.
# or we can load existing world state files. self.node_checkpoints = checkpoint["node_checkpoints"]
elif "last_processed_file" in checkpoint:
# Migrate old single-file checkpoint: extract node dir from path.
old = checkpoint["last_processed_file"]
if old:
try:
node_dir = Path(old).relative_to(EVENTS_DIR).parts[0]
self.node_checkpoints = {node_dir: old}
logger.info(f"Migrated old checkpoint → node_checkpoints: {self.node_checkpoints}")
except Exception:
pass # Bad path — start fresh
self._load_world_from_disk() self._load_world_from_disk()
except Exception as e: except Exception as e:
logger.error(f"Failed to load checkpoint: {e}") logger.error(f"Failed to load checkpoint: {e}")
@ -110,17 +152,128 @@ class Observer:
def _save_checkpoint(self): def _save_checkpoint(self):
try: try:
with open(OBSERVER_STATE_FILE, "w") as f: _atomic_write_json(OBSERVER_STATE_FILE, {"node_checkpoints": self.node_checkpoints})
json.dump({"last_processed_file": self.last_processed_file}, f)
except Exception as e: except Exception as e:
logger.error(f"Failed to save checkpoint: {e}") logger.error(f"Failed to save checkpoint: {e}")
def _prune_stale_world(self):
"""Remove world-state entries for nodes absent from the topology inventory.
Root cause this guards against: when NODE_NAME env var is unset, node_agent.py
falls back to socket.gethostname(), which inside a Docker container returns the
12-char hex container ID (e.g. 'be17cb6eb0f6') instead of the canonical host name
('vps'). The observer ingests those events and creates ghost entries that never
expire on their own.
Also ages out resolved incidents older than 7 days to keep world state lean.
"""
known_nodes = set(self.inventory["nodes"].keys())
if not known_nodes:
# Inventory failed to load — don't prune to avoid wiping valid state.
return
stale_nodes = [n for n in list(self.world_state["nodes"].keys())
if n not in known_nodes]
for n in stale_nodes:
logger.info(f"Pruning stale node from world state: {n}")
del self.world_state["nodes"][n]
stale_svcs = [k for k in list(self.world_state["services"].keys())
if k.split("/")[0] in stale_nodes]
for k in stale_svcs:
logger.info(f"Pruning stale service from world state: {k}")
del self.world_state["services"][k]
# Prune ghost service keys whose service-name portion is a hash-prefixed
# Docker stale-state artifact (e.g. "9e36297651e7_control-plane-observer").
# These are created when node-agent incorrectly uses c.name instead of the
# compose label, and accumulate on every container rebuild.
# Pattern: <node>/<12hexchars>_<real-name>
ghost_svcs = [
k for k in list(self.world_state["services"].keys())
if len(k.split("/", 1)) == 2
and len(k.split("/", 1)[1]) > 13
and k.split("/", 1)[1][12] == "_"
and all(ch in "0123456789abcdef" for ch in k.split("/", 1)[1][:12])
]
for k in ghost_svcs:
logger.info(f"Pruning ghost (hash-prefixed) service key from world state: {k}")
del self.world_state["services"][k]
now = time.time()
try:
# Collect incident_ids currently referenced by any service entry.
linked_ids: set = {
svc.get("incident_id")
for svc in self.world_state["services"].values()
if svc.get("incident_id")
}
# Case 1 — service is healthy but still points at an active incident.
# process_event already calls _resolve_incident on service_healthy events,
# but if the observer restarted with on-disk state where the link was
# intact (inconsistency from a pre-atomic-write crash), it may not get
# resolved until the next service_healthy event is processed. Resolve
# immediately — a healthy service cannot have an ongoing incident.
for svc_key, svc in self.world_state["services"].items():
if svc.get("status") != "healthy":
continue
inc_id = svc.get("incident_id")
if not inc_id:
continue
inc = self.world_state["incidents"].get(inc_id, {})
if inc.get("status") == "active":
logger.info(
f"Auto-resolving incident {inc_id} for {svc_key}: "
f"service is healthy"
)
inc["status"] = "resolved"
inc["resolved_at"] = now
svc["incident_id"] = None
linked_ids.discard(inc_id)
# Case 2 — orphaned active incident: no service entry links to it and
# last_occurrence is older than 5 minutes (guard against creation races).
# These are the stale records left behind when on-disk state was
# inconsistent: the service entry had incident_id cleared but incidents.json
# still had the record as "active".
for inc_id, inc in self.world_state["incidents"].items():
if inc.get("status") != "active":
continue
if inc_id in linked_ids:
continue
age = now - _parse_ts(inc.get("last_occurrence"))
if age > 300: # 5-minute guard
logger.info(
f"Auto-resolving orphaned incident {inc_id} "
f"(service={inc.get('service')}, node={inc.get('node')}): "
f"no service references it, age={int(age)}s"
)
inc["status"] = "resolved"
inc["resolved_at"] = now
except Exception as exc:
logger.error(f"Error during incident auto-resolve in _prune_stale_world: {exc}")
# Remove resolved incidents older than 7 days.
# Use _parse_ts so ISO-string resolved_at values are handled correctly.
stale_incidents = [
k for k, v in self.world_state["incidents"].items()
if v.get("status") == "resolved"
and now - _parse_ts(v.get("resolved_at")) > 7 * 86400
]
for k in stale_incidents:
del self.world_state["incidents"][k]
def _save_world(self): def _save_world(self):
self.world_state["summary"]["last_update"] = datetime.now(timezone.utc).isoformat() self.world_state["summary"]["last_update"] = datetime.now(timezone.utc).isoformat()
active_incidents = [ active_incidents = [
k for k, v in self.world_state["incidents"].items() if v.get("status") == "active" k for k, v in self.world_state["incidents"].items() if v.get("status") == "active"
] ]
self.world_state["summary"]["active_incidents_count"] = len(active_incidents) self.world_state["summary"]["active_incidents_count"] = len(active_incidents)
self.world_state["summary"]["node_count"] = len(self.world_state["nodes"])
self.world_state["summary"]["service_count"] = len(self.world_state["services"])
if active_incidents: if active_incidents:
self.world_state["summary"]["status"] = "degraded" self.world_state["summary"]["status"] = "degraded"
@ -132,12 +285,12 @@ class Observer:
"services.json": self.world_state["services"], "services.json": self.world_state["services"],
"deployments.json": self.world_state["deployments"], "deployments.json": self.world_state["deployments"],
"incidents.json": self.world_state["incidents"], "incidents.json": self.world_state["incidents"],
"recommendations.json": [],
"runtime-summary.json": self.world_state["summary"] "runtime-summary.json": self.world_state["summary"]
} }
for filename, data in files.items(): for filename, data in files.items():
try: try:
with open(WORLD_DIR / filename, "w") as f: _atomic_write_json(WORLD_DIR / filename, data)
json.dump(data, f, indent=2)
except Exception as e: except Exception as e:
logger.error(f"Failed to save {filename}: {e}") logger.error(f"Failed to save {filename}: {e}")
@ -164,6 +317,35 @@ class Observer:
elif etype == "node_offline": elif etype == "node_offline":
self.world_state["nodes"][node]["status"] = "offline" self.world_state["nodes"][node]["status"] = "offline"
elif etype == "node_health":
# Regular heartbeat from node-agent; updates resource metrics.
# Clears disk_pressure if disk is now healthy (< warn threshold).
self.world_state["nodes"][node]["status"] = "online"
self.world_state["nodes"][node].update({
"disk_usage_pct": payload.get("disk_pct"),
"mem_usage_pct": payload.get("mem_pct"),
"cpu_usage_pct": payload.get("cpu_pct"),
})
if (payload.get("disk_pct") or 0) < 75:
self.world_state["nodes"][node].pop("disk_pressure", None)
elif etype == "disk_pressure":
# Emitted when disk usage crosses 75 % (medium) or 85 % (high).
# The supervisor reads disk_pressure to generate disk_cleanup actions.
self.world_state["nodes"][node]["disk_pressure"] = severity
self.world_state["nodes"][node]["disk_usage_pct"] = payload.get("usage_pct")
elif etype == "high_memory":
# Memory pressure observation; recorded on the node for correlation.
# No automated action — operator decides if a container restart helps.
self.world_state["nodes"][node]["memory_pressure"] = severity
self.world_state["nodes"][node]["mem_usage_pct"] = payload.get("usage_pct")
elif etype == "high_cpu":
# CPU pressure observation; recorded for visibility.
self.world_state["nodes"][node]["cpu_pressure"] = severity
self.world_state["nodes"][node]["cpu_usage_pct"] = payload.get("usage_pct")
# 2. Update Service State # 2. Update Service State
if service and service != "all": if service and service != "all":
svc_key = f"{node}/{service}" svc_key = f"{node}/{service}"
@ -180,6 +362,15 @@ class Observer:
if etype == "service_recovered": if etype == "service_recovered":
self.world_state["services"][svc_key]["status"] = "healthy" self.world_state["services"][svc_key]["status"] = "healthy"
self._resolve_incident(svc_key, timestamp) self._resolve_incident(svc_key, timestamp)
elif etype == "service_healthy":
# Positive confirmation from node-agent that a managed container
# is running. This keeps services.json populated so the supervisor
# can correctly detect drift (absent entry = never reported = unknown,
# not the same as confirmed missing).
# Also resolve any active incident — if a service that had been
# unhealthy/crashing is now confirmed healthy, the incident is over.
self.world_state["services"][svc_key]["status"] = "healthy"
self._resolve_incident(svc_key, timestamp)
elif etype in ["service_unhealthy", "healthcheck_failed"]: elif etype in ["service_unhealthy", "healthcheck_failed"]:
self.world_state["services"][svc_key]["status"] = "unhealthy" self.world_state["services"][svc_key]["status"] = "unhealthy"
self._handle_incident(svc_key, event) self._handle_incident(svc_key, event)
@ -232,6 +423,11 @@ class Observer:
"service": event.get("service"), "service": event.get("service"),
"status": "active", "status": "active",
"severity": event.get("severity"), "severity": event.get("severity"),
# trigger_type records the event type that opened this incident so that
# the supervisor can choose the appropriate remediation action
# (e.g. container_restart for containers_not_running / mqtt_unreachable
# vs. a full redeploy for other causes).
"trigger_type": event.get("type"),
"started_at": event.get("timestamp"), "started_at": event.get("timestamp"),
"last_occurrence": event.get("timestamp"), "last_occurrence": event.get("timestamp"),
"occurrence_count": 1, "occurrence_count": 1,
@ -263,34 +459,50 @@ class Observer:
self.world_state["incidents"][incident_id]["last_error"] = payload["error"] self.world_state["incidents"][incident_id]["last_error"] = payload["error"]
def run_once(self): def run_once(self):
# Find all event files # Update heartbeat
event_files = sorted(glob.glob(str(EVENTS_DIR / "*" / "*" / "*.json"))) heartbeat_file = STATE_DIR / "observer.heartbeat"
try:
heartbeat_file.touch()
except Exception as e:
logger.error(f"Failed to touch heartbeat file: {e}")
# Collect all event files grouped by node directory.
# Per-node checkpoints are compared within each directory independently,
# so late-arriving events from remote nodes (sorted earlier in the path)
# are never skipped just because another node's checkpoint is further ahead.
all_files = sorted(glob.glob(str(EVENTS_DIR / "**" / "*.json"), recursive=True))
new_files = [] new_files = []
if self.last_processed_file: for file_path in all_files:
try: try:
idx = event_files.index(self.last_processed_file) node_dir = str(Path(file_path).relative_to(EVENTS_DIR).parts[0])
new_files = event_files[idx+1:] except (IndexError, ValueError):
except ValueError: node_dir = "__unknown__"
# If last_processed_file is gone or not in list, process all last_for_node = self.node_checkpoints.get(node_dir, "")
new_files = event_files if file_path > last_for_node:
else: new_files.append((node_dir, file_path))
new_files = event_files
if not new_files: if not new_files:
# Even if no new events, prune stale entries and refresh summary freshness.
self._prune_stale_world()
self._save_world()
return return
logger.info(f"Processing {len(new_files)} new events") logger.info(f"Processing {len(new_files)} new events across "
for file_path in new_files: f"{len({n for n, _ in new_files})} node(s)")
for node_dir, file_path in new_files:
try: try:
with open(file_path, "r") as f: with open(file_path, "r") as f:
event = json.load(f) event = json.load(f)
self.process_event(event) self.process_event(event)
self.last_processed_file = file_path # Advance per-node checkpoint (only forward — no regression).
if file_path > self.node_checkpoints.get(node_dir, ""):
self.node_checkpoints[node_dir] = file_path
except Exception as e: except Exception as e:
logger.error(f"Error processing {file_path}: {e}") logger.error(f"Error processing {file_path}: {e}")
self._save_checkpoint() self._save_checkpoint()
self._prune_stale_world()
self._save_world() self._save_world()
def loop(self, interval=5): def loop(self, interval=5):

View file

@ -0,0 +1,55 @@
### Agent System
Central runtime materializer and Operator Control Plane UI.
#### Components
- **Redis**: Central state store (on PIHA).
- **Runtime Materializer**: Converts Redis state to JSON files in `/opt/homelab/world`.
- **Web UI**: Exposes API endpoints and serving the Operator UI.
- **Telegram Bot**: Provides operator commands and action approvals via Telegram.
#### Configuration
Environment variables should be set in `.env` (see `env.example`).
Key variables for the Telegram Bot:
- `TELEGRAM_BOT_TOKEN`: Your bot token from @BotFather.
- `TELEGRAM_ALLOWED_USER_IDS`: Comma-separated list of authorized Telegram User IDs.
- `CONTROL_PLANE_URL`: URL to the `agent-system-webui` (default: `http://webui:8080`).
#### Telegram Commands
- `/status`: Check bot and API connectivity.
- `/summary`: System health overview.
- `/nodes`: List homelab nodes and their status.
- `/services`: Summary of services across nodes.
- `/unhealthy`: List all unhealthy components.
- `/incidents`: View active incidents.
- `/actions`: Summary of operator actions.
- `/help`: List all commands.
#### Deployment (on PIHA)
```bash
cd services/agent-system
./deploy.sh
```
#### Deployment (on CHELSTY)
```bash
cd services/stability-agent
docker compose up -d --build
```
#### Verification
The `deploy.sh` script automatically verifies the local endpoints.
You can also manually check:
```bash
# Check runtime summary
curl http://localhost:18180/summary
# Check discovered nodes
curl http://localhost:18180/nodes
# Check discovered services
curl http://localhost:18180/services
```
#### Directory Structure
- `/opt/homelab/world`: Contains materialized JSON state.
- `/opt/homelab/state`: Contains operator configuration and local heartbeats.

View file

@ -0,0 +1,52 @@
### Action Approval Data Model
Actions are JSON files stored in `/opt/homelab/actions/{status}/{action_id}.json`.
#### Statuses
- `pending`: Waiting for operator approval. AI agents create actions in this state.
- `approved`: Approved by operator, ready for execution.
- `rejected`: Rejected by operator, will not be executed.
- `running`: Currently being executed by an agent (e.g. `materializer`).
- `completed`: Successfully executed.
- `failed`: Execution failed.
#### Human-in-the-Loop (HIL) Protocol
1. **Request**: Agent identifies a required change and writes a JSON to `actions/pending/`.
2. **Notification**: System notifies the human operator.
3. **Audit**: Human reviews `details.reason` and `details.diff`.
4. **Authorization**: Human moves file to `approved/`.
5. **Execution**: Agent monitors `approved/` and executes the task.
#### Schema
```json
{
"action_id": "string",
"service": "string",
"node": "string",
"type": "deploy_service | restart_service | rollback | scale",
"risk": "nominal | guarded | critical",
"status": "pending | approved | rejected | ...",
"created_at": <unix_seconds>,
"updated_at": <unix_seconds>,
"details": {
"image": "string",
"reason": "string",
"diff": "string"
},
"transition_history": [
{
"from": "string | null",
"to": "string",
"timestamp": <unix_seconds>,
"by": "string (system | operator-tg-12345 | webui)"
}
]
}
```
#### Workflow
1. A system component (e.g. `runtime-materializer` or a future analyzer) creates a file in `actions/pending/`.
2. `telegram-bot` detects the file, sends a message to allowed users.
3. Operator clicks "Approve" or "Reject".
4. `telegram-bot` moves the file to `actions/approved/` or `actions/rejected/` atomically, appending a transition to `transition_history`.
5. The responsible agent (e.g. `stability-agent` on the target node) picks up the `approved` action, moves it to `running`, executes it, and finally moves it to `completed` or `failed`.

28
services/agent-system/deploy.sh Executable file
View file

@ -0,0 +1,28 @@
#!/bin/bash
set -e
echo ">>> Validating docker-compose configuration..."
docker compose config
echo ">>> Building and starting Agent System services..."
docker compose up -d --build
echo ">>> Services status:"
docker ps --filter "name=agent-system" --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"
if [ -z "$TELEGRAM_BOT_TOKEN" ]; then
echo ">>> Telegram bot status: DISABLED (token missing)"
else
echo ">>> Telegram bot status: ENABLED"
fi
echo ">>> Verifying API endpoints..."
sleep 5 # Give it a moment to start
endpoints=("summary" "nodes" "services")
for ep in "${endpoints[@]}"; do
echo "Checking /$ep..."
curl -s -f http://localhost:18180/$ep > /dev/null && echo " OK" || echo " FAILED"
done
echo ">>> Deployment complete."

View file

@ -0,0 +1,47 @@
services:
redis:
image: redis:7
container_name: agent-system-redis
ports:
- "6379:6379"
restart: unless-stopped
webui:
build: ./webui
container_name: agent-system-webui
ports:
- "18180:8080"
volumes:
- /opt/homelab:/opt/homelab
depends_on:
- redis
restart: unless-stopped
runtime-materializer:
build: ./runtime-materializer
container_name: agent-system-runtime-materializer
environment:
REDIS_HOST: redis
REDIS_PORT: "6379"
HOMELAB_WORLD_ROOT: /opt/homelab/world
WORLD_DIR: /opt/homelab/world
MATERIALIZE_INTERVAL: "10"
volumes:
- /opt/homelab:/opt/homelab
depends_on:
- redis
restart: unless-stopped
telegram-bot:
build: ./telegram-bot
container_name: agent-system-telegram-bot
environment:
TELEGRAM_BOT_TOKEN: ${TELEGRAM_BOT_TOKEN}
TELEGRAM_ALLOWED_USER_IDS: ${TELEGRAM_ALLOWED_USER_IDS}
CONTROL_PLANE_URL: ${CONTROL_PLANE_URL:-http://webui:8080}
ENABLE_LLM_FALLBACK: ${ENABLE_LLM_FALLBACK:-false}
OPENCLAW_BASE_URL: ${OPENCLAW_BASE_URL}
ACTIONS_ROOT: /opt/homelab/actions
volumes:
- /opt/homelab:/opt/homelab
restart: on-failure

View file

@ -0,0 +1,19 @@
# Telegram Bot Configuration
# Get token from @BotFather
TELEGRAM_BOT_TOKEN=123456789:ABCdefGHIjklMNOpqrsTUVwxyz
# Comma-separated list of Telegram User IDs
TELEGRAM_ALLOWED_USER_IDS=12345678,87654321
# Local control-plane API (default is internal compose address)
CONTROL_PLANE_URL=http://webui:8080
# Optional LLM fallback logic
ENABLE_LLM_FALLBACK=false
OPENCLAW_BASE_URL=http://openclaw.internal
# Runtime Materializer Configuration
REDIS_HOST=100.108.208.3
REDIS_PORT=6379
# Paths
HOMELAB_ROOT=/opt/homelab
ACTIONS_ROOT=/opt/homelab/actions
WORLD_DIR=/opt/homelab/world

View file

@ -0,0 +1,16 @@
FROM python:3.11-slim
WORKDIR /app
# Install redis python package as requested
RUN pip install --no-cache-dir redis
COPY materializer.py .
# Ensure the world directory exists in the container (though it will likely be a volume)
RUN mkdir -p /opt/homelab/world
# Use unbuffered output to see logs in docker
ENV PYTHONUNBUFFERED=1
CMD ["python", "materializer.py"]

View file

@ -0,0 +1,251 @@
import redis
import json
import os
import time
import argparse
import urllib.request
import urllib.error
from datetime import datetime
# Configuration from environment variables
REDIS_HOST = os.environ.get("REDIS_HOST", "redis")
REDIS_PORT = int(os.environ.get("REDIS_PORT", 6379))
WORLD_DIR = os.environ.get("WORLD_DIR", "/opt/homelab/world")
# When set, materialize from the control-plane HTTP API instead of Redis.
# This is the authoritative source of truth: the observer writes clean world
# state to the control-plane API, which the materializer mirrors locally so
# the webui's /snapshot (and all other endpoints) reflect the same data.
#
# Example: CONTROL_PLANE_URL=http://100.95.58.48:18180
CONTROL_PLANE_URL = os.environ.get("CONTROL_PLANE_URL", "").rstrip("/")
def get_redis_client():
"""Returns a Redis client with decoding enabled."""
return redis.Redis(
host=REDIS_HOST,
port=REDIS_PORT,
decode_responses=True,
socket_timeout=5
)
def safe_json_loads(data, default=None):
"""Safely loads JSON from a string."""
if not data:
return default
try:
if isinstance(data, (dict, list)):
return data
return json.loads(data)
except (json.JSONDecodeError, TypeError):
return data
def normalize_health(health):
"""Normalizes health values for the UI."""
if not health:
return "nominal"
h = str(health).lower()
if h in ["healthy", "ok", "running", "nominal"]:
return "nominal"
if h in ["degraded", "warning"]:
return "degraded"
return "error"
def _fetch_json(url):
"""Fetch JSON from a URL, returning parsed data or None on error."""
try:
with urllib.request.urlopen(url, timeout=10) as resp:
return json.loads(resp.read())
except Exception as e:
print(f"[{datetime.now().isoformat()}] Error fetching {url}: {e}")
return None
def write_json(filename, data):
path = os.path.join(WORLD_DIR, filename)
with open(path, "w") as f:
json.dump(data, f, indent=2)
def materialize_from_api():
"""Mirror world state from the control-plane API to local world files.
The control-plane observer on VPS is the single authoritative writer of
world state. By fetching from its HTTP API we get the same clean, pruned
data that the /summary endpoint serves no stale Redis artefacts.
Returns True if all fetches succeeded and files were written, False otherwise.
"""
print(f"[{datetime.now().isoformat()}] Materializing from control-plane API: {CONTROL_PLANE_URL}")
endpoints = {
"nodes.json": f"{CONTROL_PLANE_URL}/nodes",
"services.json": f"{CONTROL_PLANE_URL}/services",
"incidents.json": f"{CONTROL_PLANE_URL}/incidents",
"deployments.json": f"{CONTROL_PLANE_URL}/deployments",
"recommendations.json":f"{CONTROL_PLANE_URL}/recommendations",
"runtime-summary.json":f"{CONTROL_PLANE_URL}/summary",
"events.json": f"{CONTROL_PLANE_URL}/events",
}
fetched = {}
for filename, url in endpoints.items():
data = _fetch_json(url)
if data is None:
print(f"[{datetime.now().isoformat()}] Aborting: failed to fetch {url}")
return False
fetched[filename] = data
os.makedirs(WORLD_DIR, exist_ok=True)
for filename, data in fetched.items():
write_json(filename, data)
svc_count = len(fetched.get("services.json") or [])
print(f"[{datetime.now().isoformat()}] Materialized from API: {svc_count} services → {WORLD_DIR}")
return True
def materialize():
"""Reads state from Redis and writes JSON files to the world directory."""
print(f"[{datetime.now().isoformat()}] Materializing world state...")
try:
r = get_redis_client()
# 1. Nodes
nodes = []
node_keys = r.keys("homelab:nodes:*")
for key in node_keys:
node_data = r.hgetall(key)
if node_data:
# Normalize health
if "health" in node_data:
node_data["health"] = normalize_health(node_data["health"])
# Parse JSON fields if they exist
if "capabilities" in node_data:
node_data["capabilities"] = safe_json_loads(node_data["capabilities"], [])
if "checks" in node_data:
node_data["checks"] = safe_json_loads(node_data["checks"], {})
nodes.append(node_data)
# 2. Services
services = []
service_keys = r.keys("homelab:services:*")
for key in service_keys:
svc_data = r.hgetall(key)
if svc_data:
# Normalize health
if "health" in svc_data:
svc_data["health"] = normalize_health(svc_data["health"])
if "dependencies" in svc_data:
svc_data["dependencies"] = safe_json_loads(svc_data["dependencies"], [])
if "recommendations" in svc_data:
svc_data["recommendations"] = safe_json_loads(svc_data["recommendations"], [])
services.append(svc_data)
# 3. Events (Stream)
events = []
try:
# Get last 100 events from the stream
raw_events = r.xrevrange("homelab:events", count=100)
for event_id, data in raw_events:
event = data.copy()
event["id"] = event_id
if "details" in event:
event["details"] = safe_json_loads(event["details"], {})
events.append(event)
except redis.exceptions.ResponseError:
# homelab:events might not be a stream or doesn't exist
pass
# 4. Incidents (Hash)
incidents = []
incident_keys = r.keys("homelab:incidents:*")
for key in incident_keys:
incident_data = r.hgetall(key)
if incident_data:
# Normalize health if present
if "health" in incident_data:
incident_data["health"] = normalize_health(incident_data["health"])
incidents.append(incident_data)
# 5. Deployments (Hash)
deployments = []
deployment_keys = r.keys("homelab:deployments:*")
for key in deployment_keys:
dep_data = r.hgetall(key)
if dep_data:
deployments.append(dep_data)
# 6. Recommendations (Hash)
recommendations = []
recommendation_keys = r.keys("homelab:recommendations:*")
for key in recommendation_keys:
rec_data = r.hgetall(key)
if rec_data:
recommendations.append(rec_data)
# 7. Runtime Summary
unhealthy_services = [s for s in services if s.get("health") != "nominal"]
active_incidents = [i for i in incidents if i.get("status") not in ["resolved", "closed"]]
status = "nominal"
if len(active_incidents) > 0 or len(unhealthy_services) > 5:
status = "error"
elif len(unhealthy_services) > 0:
status = "degraded"
summary = {
"status": status,
"timestamp": datetime.utcnow().isoformat() + "Z",
"last_update": int(time.time()),
"node_count": len(nodes),
"service_count": len(services),
"active_incidents_count": len(active_incidents),
"unhealthy_services_count": len(unhealthy_services),
"incident_count": len(incidents),
"recent_events_count": len(events),
"stale": False
}
# Ensure directory exists
os.makedirs(WORLD_DIR, exist_ok=True)
write_json("runtime-summary.json", summary)
write_json("nodes.json", nodes)
write_json("services.json", services)
write_json("incidents.json", incidents)
write_json("events.json", events)
write_json("deployments.json", deployments)
write_json("recommendations.json", recommendations)
print(f"[{datetime.now().isoformat()}] Successfully materialized to {WORLD_DIR}")
except redis.exceptions.ConnectionError as e:
print(f"Redis connection error: {e}")
except Exception as e:
print(f"Unexpected error during materialization: {e}")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Homelab Runtime Materializer")
parser.add_argument("--once", action="store_true", help="Run once and exit")
parser.add_argument("--interval", type=int, default=30, help="Sleep interval between runs (seconds)")
args = parser.parse_args()
if CONTROL_PLANE_URL:
print(f"Mode: control-plane API ({CONTROL_PLANE_URL})")
run_fn = materialize_from_api
else:
print(f"Mode: Redis ({REDIS_HOST}:{REDIS_PORT})")
run_fn = materialize
interval = int(os.environ.get("MATERIALIZE_INTERVAL", args.interval))
if args.once:
run_fn()
else:
print(f"Starting materializer loop (interval: {interval}s)...")
while True:
run_fn()
time.sleep(interval)

View file

@ -0,0 +1,39 @@
#!/bin/bash
# Script to create a test pending action for Telegram bot verification.
ACTIONS_PENDING_DIR=${ACTIONS_ROOT:-/opt/homelab/actions}/pending
mkdir -p "$ACTIONS_PENDING_DIR"
ACTION_ID="test-$(date +%s)"
FILE_PATH="$ACTIONS_PENDING_DIR/$ACTION_ID.json"
TIMESTAMP=$(date +%s)
cat <<EOF > "$FILE_PATH"
{
"action_id": "$ACTION_ID",
"service": "frigate",
"node": "chelsty",
"type": "deploy_service",
"risk": "guarded",
"status": "pending",
"created_at": $TIMESTAMP,
"updated_at": $TIMESTAMP,
"details": {
"image": "blakeblackshear/frigate:0.13.0",
"reason": "Security update for Frigate",
"diff": "image: blakeblackshear/frigate:0.12.0 -> 0.13.0"
},
"transition_history": [
{
"from": null,
"to": "pending",
"timestamp": $TIMESTAMP,
"by": "system-test"
}
]
}
EOF
echo "Test action created: $FILE_PATH"
echo "If the telegram-bot is running and configured, you should receive a notification."

View file

@ -0,0 +1,10 @@
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY bot.py .
CMD ["python", "bot.py"]

View file

@ -0,0 +1,454 @@
import os
import json
import time
import asyncio
import logging
import urllib.request
import urllib.error
from pathlib import Path
from telegram import Update, InlineKeyboardButton, InlineKeyboardMarkup
from telegram.ext import ApplicationBuilder, ContextTypes, CommandHandler, CallbackQueryHandler, MessageHandler, filters
# Setup logging
logging.basicConfig(
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
level=logging.INFO
)
logger = logging.getLogger(__name__)
# Configuration
TOKEN = os.getenv("TELEGRAM_BOT_TOKEN")
ALLOWED_IDS = [int(i.strip()) for i in os.getenv("TELEGRAM_ALLOWED_USER_IDS", "").split(",") if i.strip()]
ACTIONS_ROOT = Path(os.getenv("ACTIONS_ROOT", "/opt/homelab/actions"))
CONTROL_PLANE_URL = os.getenv("CONTROL_PLANE_URL", "http://webui:8080")
ENABLE_LLM_FALLBACK = os.getenv("ENABLE_LLM_FALLBACK", "false").lower() == "true"
OPENCLAW_BASE_URL = os.getenv("OPENCLAW_BASE_URL")
async def fetch_api(path):
"""Helper to fetch JSON from the Control Plane API."""
url = f"{CONTROL_PLANE_URL.rstrip('/')}/{path.lstrip('/')}"
try:
def do_request():
req = urllib.request.Request(url)
with urllib.request.urlopen(req, timeout=5) as response:
if response.status != 200:
return None
return json.loads(response.read().decode())
return await asyncio.to_thread(do_request)
except Exception as e:
logger.error(f"Error fetching {url}: {e}")
return None
async def post_api(path, data):
"""Helper to POST JSON to the Control Plane API."""
url = f"{CONTROL_PLANE_URL.rstrip('/')}/{path.lstrip('/')}"
try:
body = json.dumps(data).encode("utf-8")
def do_request():
req = urllib.request.Request(url, data=body, method="POST")
req.add_header("Content-Type", "application/json")
with urllib.request.urlopen(req, timeout=5) as response:
return response.status == 200
return await asyncio.to_thread(do_request)
except Exception as e:
logger.error(f"Error posting to {url}: {e}")
return False
def _format_pending_action(action_id: str, data: dict) -> str:
"""Build the Telegram Markdown message for a pending action notification.
Extracted so it can be unit-tested without a live Telegram connection.
"""
# Supervisor writes risk_level; action-model.md legacy schema used risk.
risk = data.get("risk_level") or data.get("risk", "unknown")
message = (
f"⚠️ *Pending Action*\n"
f"ID: `{action_id}`\n"
f"Type: `{data.get('type', 'unknown')}`\n"
f"Service: `{data.get('service', 'unknown')}`\n"
f"Node: `{data.get('node', 'unknown')}`\n"
f"Risk: *{risk}*\n"
)
# description carries the human-readable substance of the action (required for
# alert_only actions where it is the entire operator-visible message).
description = data.get("description", "")
if description:
truncated = description[:300] + ("..." if len(description) > 300 else "")
message += f"Description: `{truncated}`\n"
# Legacy details block (old action-model.md schema) — kept for backwards compat.
if "details" in data:
details_str = json.dumps(data["details"], indent=2)
if len(details_str) > 1000:
details_str = details_str[:1000] + "..."
message += f"\nDetails:\n```json\n{details_str}\n```"
return message
class ApprovalBot:
def __init__(self):
self.pending_dir = ACTIONS_ROOT / "pending"
self.approved_dir = ACTIONS_ROOT / "approved"
self.rejected_dir = ACTIONS_ROOT / "rejected"
# Track which action IDs we have already notified in this session to avoid spam
self.notified_actions = set()
async def check_pending_actions(self, context: ContextTypes.DEFAULT_TYPE):
"""Job that periodically checks for new pending action files."""
if not self.pending_dir.exists():
return
try:
for action_file in self.pending_dir.glob("*.json"):
action_id = action_file.stem
if action_id in self.notified_actions:
continue
try:
data = json.loads(action_file.read_text())
# Only notify if it's truly pending
if data.get("status") == "pending":
await self.notify_users(context, action_id, data)
self.notified_actions.add(action_id)
except Exception as e:
logger.error(f"Error processing action file {action_file}: {e}")
except Exception as e:
logger.error(f"Error scanning pending directory: {e}")
async def notify_users(self, context: ContextTypes.DEFAULT_TYPE, action_id: str, data: dict):
"""Sends an approval request message to all allowed users."""
message = _format_pending_action(action_id, data)
keyboard = [
[
InlineKeyboardButton("✅ Approve", callback_data=f"approve:{action_id}"),
InlineKeyboardButton("❌ Reject", callback_data=f"reject:{action_id}"),
]
]
reply_markup = InlineKeyboardMarkup(keyboard)
for user_id in ALLOWED_IDS:
try:
await context.bot.send_message(
chat_id=user_id,
text=message,
parse_mode="Markdown",
reply_markup=reply_markup
)
logger.info(f"Notified user {user_id} about action {action_id}")
except Exception as e:
logger.error(f"Failed to notify user {user_id}: {e}")
async def handle_callback(self, update: Update, context: ContextTypes.DEFAULT_TYPE):
"""Handles button clicks for Approve/Reject."""
query = update.callback_query
user_id = query.from_user.id
if user_id not in ALLOWED_IDS:
await query.answer("Unauthorized", show_alert=True)
return
await query.answer()
cb_data = query.data
if ":" not in cb_data:
return
action, action_id = cb_data.split(":", 1)
target_status = "approved" if action == "approve" else "rejected"
# Use API for mutation if available, fallback to local disk move
success = await post_api("/action/mutate", {"id": action_id, "status": target_status})
msg = "Success" if success else "API call failed"
if not success:
# Fallback to direct disk manipulation (original behavior)
success, msg = self.move_action(action_id, target_status, user_id, query.from_user.username or str(user_id))
if success:
status_text = "✅ Approved" if target_status == "approved" else "❌ Rejected"
await query.edit_message_text(
text=query.message.text + f"\n\n{status_text} by {query.from_user.first_name}",
parse_mode="Markdown"
)
# Remove from notified list as it's no longer pending
if action_id in self.notified_actions:
self.notified_actions.remove(action_id)
else:
await query.message.reply_text(f"Failed to process action {action_id}: {msg}")
def move_action(self, action_id, target_status, user_id, username):
"""Moves action file and updates its status and history."""
source_path = self.pending_dir / f"{action_id}.json"
if not source_path.exists():
return False, "Action file no longer exists in pending."
target_dir = self.approved_dir if target_status == "approved" else self.rejected_dir
target_dir.mkdir(parents=True, exist_ok=True)
target_path = target_dir / f"{action_id}.json"
try:
data = json.loads(source_path.read_text())
current_status = data.get("status", "pending")
# Update data
data["status"] = target_status
data["updated_at"] = time.time()
history = data.get("transition_history", [])
history.append({
"from": current_status,
"to": target_status,
"timestamp": time.time(),
"by": f"tg:{username}"
})
data["transition_history"] = history
# Atomic move: write to new location, then delete old
target_path.write_text(json.dumps(data, indent=2))
source_path.unlink()
logger.info(f"Action {action_id} moved from {current_status} to {target_status} by {username}")
return True, "Success"
except Exception as e:
logger.error(f"Error moving action file: {e}")
return False, str(e)
async def start_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
"""Simple start command to help users find their ID."""
user = update.effective_user
message = (
f"Hello {user.first_name}! 🤖\n"
f"Your Telegram User ID is: `{user.id}`\n\n"
)
if user.id in ALLOWED_IDS:
message += "✅ You are authorized to manage the homelab.\n\n"
message += "Use /help to see available commands."
else:
message += "❌ You are NOT authorized. Add your ID to `TELEGRAM_ALLOWED_USER_IDS`."
await update.message.reply_text(message, parse_mode="Markdown")
async def status_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
if update.effective_user.id not in ALLOWED_IDS: return
res = await fetch_api("/summary")
status = "✅ Online" if res else "❌ Unreachable"
message = (
f"🤖 *Telegram Bot Status*\n"
f"Control Plane API: {status}\n"
f"Target URL: `{CONTROL_PLANE_URL}`\n"
)
await update.message.reply_text(message, parse_mode="Markdown")
async def summary_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
if update.effective_user.id not in ALLOWED_IDS: return
data = await fetch_api("/summary")
if not data:
await update.message.reply_text("❌ Failed to fetch summary from Control Plane.")
return
msg = "📊 *System Summary*\n"
msg += f"Status: `{data.get('status', 'unknown')}`\n"
msg += f"Nodes: {data.get('node_count', 0)}\n"
msg += f"Services: {data.get('service_count', 0)}\n"
msg += f"Active Incidents: {data.get('active_incidents_count', 0)}\n"
if data.get('stale'):
msg += "\n⚠️ *Warning: Data is stale!*"
await update.message.reply_text(msg, parse_mode="Markdown")
async def nodes_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
if update.effective_user.id not in ALLOWED_IDS: return
nodes = await fetch_api("/nodes")
if nodes is None:
await update.message.reply_text("❌ Failed to fetch nodes.")
return
if not nodes:
await update.message.reply_text("No nodes discovered in the fleet.")
return
msg = "🖥️ *Nodes Status*\n"
for node in nodes:
health_icon = "" if node.get('health') == 'nominal' else "⚠️" if node.get('health') == 'degraded' else ""
msg += f"{health_icon} *{node.get('hostname')}*: `{node.get('status', 'unknown')}`\n"
msg += f" Last seen: {node.get('last_seen', 'N/A')}\n"
await update.message.reply_text(msg, parse_mode="Markdown")
async def services_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
if update.effective_user.id not in ALLOWED_IDS: return
services = await fetch_api("/services")
if services is None:
await update.message.reply_text("❌ Failed to fetch services.")
return
# Summarize by node
nodes = {}
for s in services:
node = s.get("node", "unknown")
if node not in nodes: nodes[node] = []
nodes[node].append(s)
msg = "⚙️ *Services Summary*\n"
if not nodes:
msg += "No services discovered."
else:
for node, svc_list in sorted(nodes.items()):
nominal = len([s for s in svc_list if s.get("health") == "nominal"])
msg += f"• *{node}*: {nominal}/{len(svc_list)} nominal\n"
msg += "\nUse /unhealthy to see issues."
await update.message.reply_text(msg, parse_mode="Markdown")
async def unhealthy_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
if update.effective_user.id not in ALLOWED_IDS: return
services = await fetch_api("/services")
nodes = await fetch_api("/nodes")
msg = "⚠️ *Unhealthy Components*\n"
found = False
if services:
for s in services:
health = s.get("health", "").lower()
if health != "nominal":
msg += f"• Service *{s.get('name')}* on *{s.get('node')}*: `{health}`\n"
found = True
if nodes:
for n in nodes:
checks = n.get("checks", {})
if isinstance(checks, str):
try: checks = json.loads(checks)
except: checks = {}
docker = checks.get("docker", {})
if docker.get("status") == "ok":
for c in docker.get("containers", []):
if c.get("state") != "running":
msg += f"• Container *{c.get('name')}* on *{n.get('hostname')}*: `{c.get('state')}`\n"
found = True
if not found:
msg += "All systems nominal. ✅"
await update.message.reply_text(msg, parse_mode="Markdown")
async def incidents_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
if update.effective_user.id not in ALLOWED_IDS: return
incidents = await fetch_api("/incidents")
if incidents is None:
await update.message.reply_text("❌ Failed to fetch incidents.")
return
active = [i for i in incidents if i.get("status") not in ("resolved", "closed")]
if not active:
await update.message.reply_text("No active incidents. ✅")
return
msg = "🚨 *Active Incidents*\n"
for inc in active:
severity = inc.get('severity', 'info').upper()
msg += f"• [{severity}] *{inc.get('type')}*: {inc.get('message')}\n"
await update.message.reply_text(msg, parse_mode="Markdown")
async def actions_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
if update.effective_user.id not in ALLOWED_IDS: return
actions = await fetch_api("/actions")
if actions is None:
await update.message.reply_text("❌ Actions endpoint unavailable.")
return
msg = "⚡ *Actions Summary*\n"
total = 0
for status, act_list in actions.items():
if act_list:
msg += f"{status.capitalize()}: {len(act_list)}\n"
total += len(act_list)
if total == 0:
msg = "No actions recorded."
await update.message.reply_text(msg, parse_mode="Markdown")
async def help_command(update: Update, context: ContextTypes.DEFAULT_TYPE):
msg = (
"📖 *Supported Commands*\n\n"
"/status - Check bot and API connectivity\n"
"/summary - System health overview\n"
"/nodes - List homelab nodes and their status\n"
"/services - Summary of services across nodes\n"
"/unhealthy - List all unhealthy components\n"
"/incidents - View active incidents\n"
"/actions - Summary of operator actions\n"
"/help - Show this help message\n\n"
"Free text will be handled by the guidance system."
)
await update.message.reply_text(msg, parse_mode="Markdown")
async def handle_fallback(update: Update, context: ContextTypes.DEFAULT_TYPE):
"""Handles non-command messages."""
if update.effective_user.id not in ALLOWED_IDS: return
if ENABLE_LLM_FALLBACK and OPENCLAW_BASE_URL:
# Placeholder for OpenClaw LLM fallback
# In a real scenario, this would call the LLM API
logger.info(f"LLM fallback requested for: {update.message.text}")
await update.message.reply_text(
"Use /summary, /nodes, /services, /unhealthy, /incidents, /actions."
)
async def run_bot():
if not TOKEN:
print("CRITICAL: TELEGRAM_BOT_TOKEN is not set. Telegram bot will not start.")
# Keep process alive to not crash compose if not desired, but here we just exit
# Requirement says: "do not fail if Telegram token is absent, but telegram-bot should be disabled or exit cleanly"
return
bot_logic = ApprovalBot()
application = ApplicationBuilder().token(TOKEN).build()
application.add_handler(CommandHandler("start", start_command))
application.add_handler(CommandHandler("status", status_command))
application.add_handler(CommandHandler("summary", summary_command))
application.add_handler(CommandHandler("nodes", nodes_command))
application.add_handler(CommandHandler("services", services_command))
application.add_handler(CommandHandler("unhealthy", unhealthy_command))
application.add_handler(CommandHandler("incidents", incidents_command))
application.add_handler(CommandHandler("actions", actions_command))
application.add_handler(CommandHandler("help", help_command))
application.add_handler(MessageHandler(filters.TEXT & (~filters.COMMAND), handle_fallback))
application.add_handler(CallbackQueryHandler(bot_logic.handle_callback))
# Schedule the pending actions check
job_queue = application.job_queue
if job_queue:
job_queue.run_repeating(bot_logic.check_pending_actions, interval=10, first=5)
else:
logger.warning("JobQueue is not available. Periodic pending actions check will be skipped.")
logger.info("Starting Telegram Approval Bot...")
await application.initialize()
await application.start()
await application.updater.start_polling()
# Run until the application is stopped
stop_event = asyncio.Event()
try:
await stop_event.wait()
except (KeyboardInterrupt, SystemExit):
logger.info("Stopping bot...")
finally:
await application.stop()
await application.shutdown()
if __name__ == "__main__":
try:
asyncio.run(run_bot())
except KeyboardInterrupt:
pass
except Exception as e:
logger.error(f"Fatal error: {e}")

View file

@ -0,0 +1 @@
python-telegram-bot[job-queue]==20.7

View file

@ -0,0 +1,38 @@
"""Stub telegram before bot.py is imported so pytest doesn't need the real package."""
from __future__ import annotations
import sys
import types
from unittest.mock import MagicMock
def _make_telegram_stub() -> types.ModuleType:
mod = types.ModuleType("telegram")
mod.Update = MagicMock
mod.InlineKeyboardButton = MagicMock
mod.InlineKeyboardMarkup = MagicMock
return mod
def _make_telegram_ext_stub() -> types.ModuleType:
mod = types.ModuleType("telegram.ext")
mod.ApplicationBuilder = MagicMock
# ContextTypes.DEFAULT_TYPE is referenced as a type annotation at class-body
# evaluation time, so it must be a real attribute, not a dynamic MagicMock attr.
ContextTypesMock = MagicMock()
ContextTypesMock.DEFAULT_TYPE = type(None)
mod.ContextTypes = ContextTypesMock
mod.CommandHandler = MagicMock
mod.CallbackQueryHandler = MagicMock
mod.MessageHandler = MagicMock
mod.filters = MagicMock()
return mod
# Insert before any import of bot.py
if "telegram" not in sys.modules:
sys.modules["telegram"] = _make_telegram_stub()
if "telegram.ext" not in sys.modules:
sys.modules["telegram.ext"] = _make_telegram_ext_stub()

View file

@ -0,0 +1,116 @@
"""Tests for _format_pending_action — no Telegram connection required.
telegram stubs are set up in conftest.py before this module is imported.
"""
from __future__ import annotations
import sys
from pathlib import Path
import pytest
sys.path.insert(0, str(Path(__file__).parent.parent))
from bot import _format_pending_action
# ---------------------------------------------------------------------------
# Bug 1 — risk_level field
# ---------------------------------------------------------------------------
def test_risk_level_shown_when_present():
data = {
"type": "container_restart", "service": "homeassistant",
"node": "chelsty-ha", "risk_level": "low",
}
msg = _format_pending_action("container-restart-chelsty-ha-homeassistant", data)
assert "Risk: *low*" in msg
assert "unknown" not in msg
def test_risk_falls_back_to_legacy_risk_key():
data = {
"type": "redeploy", "service": "mosquitto",
"node": "chelsty-infra", "risk": "guarded",
}
msg = _format_pending_action("redeploy-chelsty-infra-mosquitto", data)
assert "Risk: *guarded*" in msg
def test_risk_unknown_when_both_absent():
data = {"type": "redeploy", "service": "foo", "node": "bar"}
msg = _format_pending_action("redeploy-bar-foo", data)
assert "Risk: *unknown*" in msg
# ---------------------------------------------------------------------------
# Bug 2 — description field
# ---------------------------------------------------------------------------
def test_description_shown_for_alert_only():
data = {
"type": "alert_only", "service": "homeassistant",
"node": "chelsty-ha", "risk_level": "info",
"description": "3 entities unavailable for >1h",
}
msg = _format_pending_action("alert-ha-entity-unavailable-chelsty-ha", data)
assert "3 entities unavailable for >1h" in msg
assert "Description:" in msg
def test_description_shown_for_container_restart():
data = {
"type": "container_restart", "service": "homeassistant",
"node": "chelsty-ha", "risk_level": "low",
"description": "Restart 'homeassistant' on chelsty-ha: HA WebSocket unresponsive",
}
msg = _format_pending_action("container-restart-chelsty-ha-homeassistant", data)
assert "HA WebSocket unresponsive" in msg
def test_description_absent_no_crash():
data = {"type": "redeploy", "service": "foo", "node": "bar", "risk_level": "guarded"}
msg = _format_pending_action("redeploy-bar-foo", data)
assert "Description:" not in msg
assert "Risk: *guarded*" in msg
def test_description_truncated_at_300_chars():
long_desc = "x" * 400
data = {
"type": "alert_only", "service": "homeassistant",
"node": "chelsty-ha", "risk_level": "info",
"description": long_desc,
}
msg = _format_pending_action("alert-ha-foo-chelsty-ha", data)
assert "x" * 300 in msg
assert "..." in msg
assert "x" * 301 not in msg
# ---------------------------------------------------------------------------
# Combined — real HA alert_only action shape
# ---------------------------------------------------------------------------
def test_ha_alert_only_full_action():
"""Mirrors an actual alert_only action written by supervisor._generate_ha_alert_only."""
data = {
"action_id": "alert-ha-entity-unavailable-chelsty-ha",
"type": "alert_only",
"node": "chelsty-ha",
"service": "homeassistant",
"risk_level": "info",
"confidence": 1.0,
"description": "3 entities unavailable for >1h: sensor.power, binary_sensor.window",
"status": "pending",
"payload": {
"location_tag": "chelsty",
"reason": "ha_entity_unavailable_long",
"count": 3,
},
}
msg = _format_pending_action(data["action_id"], data)
assert "alert_only" in msg
assert "chelsty-ha" in msg
assert "Risk: *info*" in msg
assert "3 entities unavailable" in msg
assert "unknown" not in msg

View file

@ -0,0 +1,7 @@
FROM python:3.11-slim
WORKDIR /app
COPY web.py index.html ./
EXPOSE 8080
CMD ["python", "web.py"]

View file

@ -0,0 +1,769 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Operator Control Plane</title>
<style>
:root {
--bg-color: #0a0c0e;
--sidebar-color: #14171a;
--card-color: #1c2024;
--border-color: #2a3540;
--text-color: #e7edf3;
--text-muted: #94a3b8;
--accent-color: #3eaf7c;
--nominal: #3eaf7c;
--degraded: #e7c000;
--unstable: #e67e22;
--reconciling: #3498db;
--error: #c0392b;
--safe: #3eaf7c;
--guarded: #e67e22;
--dangerous: #c0392b;
}
body {
margin: 0;
font-family: 'Inter', system-ui, -apple-system, sans-serif;
background: var(--bg-color);
color: var(--text-color);
display: flex;
height: 100vh;
overflow: hidden;
}
/* Sidebar */
.sidebar {
width: 240px;
background: var(--sidebar-color);
border-right: 1px solid var(--border-color);
display: flex;
flex-direction: column;
flex-shrink: 0;
}
.sidebar-header {
padding: 24px;
font-weight: 800;
font-size: 14px;
letter-spacing: 0.1em;
color: var(--accent-color);
border-bottom: 1px solid var(--border-color);
}
.nav-list {
list-style: none;
padding: 12px 0;
margin: 0;
flex-grow: 1;
}
.nav-item {
padding: 12px 24px;
cursor: pointer;
font-size: 14px;
color: var(--text-muted);
transition: all 0.2s;
display: flex;
align-items: center;
gap: 12px;
}
.nav-item:hover {
background: rgba(255, 255, 255, 0.05);
color: var(--text-color);
}
.nav-item.active {
background: rgba(62, 175, 124, 0.1);
color: var(--accent-color);
border-left: 3px solid var(--accent-color);
}
.sidebar-footer {
padding: 16px;
border-top: 1px solid var(--border-color);
font-size: 12px;
}
/* Content Area */
.main-content {
flex-grow: 1;
display: flex;
flex-direction: column;
overflow: hidden;
}
header {
height: 64px;
border-bottom: 1px solid var(--border-color);
display: flex;
align-items: center;
padding: 0 24px;
justify-content: space-between;
background: var(--bg-color);
}
.view-title {
font-size: 18px;
font-weight: 600;
}
.content-scroll {
flex-grow: 1;
overflow-y: auto;
padding: 24px;
}
/* Cards & Grids */
.grid {
display: grid;
grid-template-columns: repeat(auto-fill, minmax(350px, 1fr));
gap: 20px;
}
.card {
background: var(--card-color);
border: 1px solid var(--border-color);
padding: 20px;
border-radius: 4px;
position: relative;
}
.card-header {
display: flex;
justify-content: space-between;
align-items: center;
margin-bottom: 16px;
}
.card-title {
font-weight: 700;
font-size: 16px;
}
/* Status Badges */
.badge {
padding: 4px 8px;
border-radius: 4px;
font-size: 11px;
font-weight: 700;
text-transform: uppercase;
}
.status-nominal { background: rgba(62, 175, 124, 0.1); color: var(--nominal); }
.status-degraded { background: rgba(231, 192, 0, 0.1); color: var(--degraded); }
.status-unstable { background: rgba(230, 126, 34, 0.1); color: var(--unstable); }
.status-reconciling { background: rgba(52, 152, 219, 0.1); color: var(--reconciling); }
.status-error { background: rgba(192, 57, 43, 0.1); color: var(--error); }
/* Timeline */
.timeline {
display: flex;
flex-direction: column;
gap: 12px;
}
.event {
padding: 12px;
border-left: 2px solid var(--border-color);
background: rgba(255, 255, 255, 0.02);
font-family: ui-monospace, monospace;
font-size: 13px;
}
.event.high { border-left-color: var(--error); }
.event.medium { border-left-color: var(--unstable); }
.event.low { border-left-color: var(--nominal); }
.event-header {
display: flex;
justify-content: space-between;
margin-bottom: 4px;
color: var(--text-muted);
}
/* Forms & Inputs */
.controls {
display: flex;
gap: 12px;
margin-top: 20px;
}
input, button {
background: var(--card-color);
border: 1px solid var(--border-color);
color: var(--text-color);
padding: 8px 16px;
font-size: 14px;
border-radius: 4px;
}
button {
cursor: pointer;
font-weight: 600;
}
button:hover { background: var(--border-color); }
.btn-primary { background: var(--accent-color); color: white; border: none; }
.btn-primary:hover { background: #359b6d; }
/* Utility */
.hidden { display: none !important; }
.mono { font-family: ui-monospace, monospace; }
.label { color: var(--text-muted); font-size: 12px; margin-bottom: 4px; }
.value { font-weight: 500; margin-bottom: 12px; }
.risk-safe { background: rgba(62, 175, 124, 0.1); color: var(--safe); }
.risk-guarded { background: rgba(230, 126, 34, 0.1); color: var(--guarded); }
.risk-dangerous { background: rgba(192, 57, 43, 0.1); color: var(--dangerous); }
</style>
</head>
<body>
<aside class="sidebar">
<div class="sidebar-header">HOMELAB OPERATOR</div>
<ul class="nav-list">
<li class="nav-item active" onclick="showView('dashboard', this)">
<span>Dashboard</span>
</li>
<li class="nav-item" onclick="showView('actions', this)">
<span>Action Queue</span>
</li>
<li class="nav-item" onclick="showView('nodes', this)">
<span>Nodes</span>
</li>
<li class="nav-item" onclick="showView('services', this)">
<span>Services</span>
</li>
<li class="nav-item" onclick="showView('deployments', this)">
<span>Deployments</span>
</li>
<li class="nav-item" onclick="showView('topology', this)">
<span>Topology</span>
</li>
<li class="nav-item" onclick="showView('events', this)">
<span>Events</span>
</li>
<li class="nav-item" onclick="showView('correlation', this)">
<span>Correlation</span>
</li>
<li class="nav-item" onclick="showView('recommendations', this)">
<span>Recommendations</span>
</li>
<li class="nav-item" onclick="showView('settings', this)">
<span>Settings</span>
</li>
</ul>
<div class="sidebar-footer">
<div id="summary-status">System Status: Loading...</div>
</div>
</aside>
<main class="main-content">
<div id="stale-banner" class="hidden" style="background:var(--error); color:white; padding:8px 24px; font-weight:bold; font-size:12px; text-align:center; letter-spacing:0.05em">
RUNTIME STATE IS STALE
</div>
<header>
<div style="display:flex; align-items:center; gap:20px">
<div class="view-title" id="current-view-title">Dashboard</div>
<select id="operator-mode" onchange="setOperatorMode(this.value)" style="background:var(--sidebar-color); border:1px solid var(--border-color); color:var(--accent-color); font-weight:bold; font-size:12px; padding:4px 8px">
<option value="observe">OBSERVE</option>
<option value="recommend">RECOMMEND</option>
<option value="approval" selected>APPROVAL</option>
<option value="autonomous">AUTONOMOUS</option>
<option value="maintenance">MAINTENANCE</option>
</select>
</div>
<div class="header-actions" style="display:flex; gap:8px; align-items:center">
<button onclick="refreshData()">Refresh</button>
<button id="copy-ai-btn" onclick="copyForAI()">Copy for AI</button>
</div>
</header>
<div class="content-scroll">
<!-- Dashboard View -->
<div id="view-dashboard" class="view">
<div class="grid">
<div class="card">
<div class="card-title">System Overview</div>
<div id="dashboard-summary" style="margin-top:20px"></div>
</div>
<div class="card">
<div class="card-title">Pending Actions</div>
<div id="dashboard-actions-summary" style="margin-top:20px"></div>
</div>
<div class="card">
<div class="card-title">Active Incidents</div>
<div id="dashboard-incidents" style="margin-top:20px"></div>
</div>
</div>
</div>
<!-- Actions View -->
<div id="view-actions" class="view hidden">
<div style="display:grid; grid-template-columns: 1fr 1fr; gap:24px">
<div>
<h3>Pending Approval</h3>
<div id="actions-pending" class="timeline"></div>
</div>
<div>
<h3>Active / History</h3>
<div id="actions-history" class="timeline"></div>
</div>
</div>
</div>
<!-- Nodes View -->
<div id="view-nodes" class="view hidden">
<div class="grid" id="nodes-list"></div>
</div>
<!-- Services View -->
<div id="view-services" class="view hidden">
<div class="grid" id="services-list"></div>
</div>
<!-- Deployments View -->
<div id="view-deployments" class="view hidden">
<div class="grid" id="deployments-list"></div>
</div>
<!-- Topology View -->
<div id="view-topology" class="view hidden">
<div class="card" style="min-height:500px">
<div class="card-title">Runtime Topology</div>
<div id="topology-map" style="margin-top:20px; display:flex; flex-wrap:wrap; gap:40px; justify-content:center"></div>
</div>
</div>
<!-- Events View -->
<div id="view-events" class="view hidden">
<div class="timeline" id="events-timeline"></div>
</div>
<!-- Correlation View -->
<div id="view-correlation" class="view hidden">
<div id="correlation-chains" class="grid"></div>
</div>
<!-- Recommendations View -->
<div id="view-recommendations" class="view hidden">
<div class="grid" id="recommendations-list"></div>
</div>
<!-- Settings View -->
<div id="view-settings" class="view hidden">
<div class="card">
<div class="card-title">Configuration</div>
<div id="settings-content" style="margin-top:20px"></div>
</div>
</div>
</div>
</main>
<script>
let currentView = 'dashboard';
const pollInterval = 5000;
function showView(viewId, el) {
document.querySelectorAll('.view').forEach(v => v.classList.add('hidden'));
document.getElementById('view-' + viewId).classList.remove('hidden');
document.querySelectorAll('.nav-item').forEach(i => i.classList.remove('active'));
if (el) el.classList.add('active');
currentView = viewId;
document.getElementById('current-view-title').textContent = viewId.charAt(0).toUpperCase() + viewId.slice(1);
refreshData();
}
async function fetchData(endpoint) {
try {
const res = await fetch(endpoint, {cache: 'no-store'});
return await res.json();
} catch (e) {
console.error('Fetch error:', endpoint, e);
return null;
}
}
async function postData(endpoint, data) {
try {
const res = await fetch(endpoint, {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify(data)
});
return await res.json();
} catch (e) {
console.error('Post error:', endpoint, e);
return null;
}
}
async function mutateAction(id, status) {
const res = await postData('/action/mutate', {id, status});
if (res && res.status === 'ok') {
refreshData();
} else {
alert('Mutation failed');
}
}
async function setOperatorMode(mode) {
console.log('Operator mode set to:', mode);
const res = await postData('/mode', {mode});
if (res && res.status === 'ok') {
console.log('Mode updated successfully');
}
}
function formatTime(ts) {
if (!ts) return 'N/A';
return new Date(ts * 1000).toLocaleString();
}
function getStatusClass(status) {
status = (status || '').toLowerCase();
if (['nominal', 'healthy', 'ok', 'up'].includes(status)) return 'status-nominal';
if (['degraded', 'warning'].includes(status)) return 'status-degraded';
if (['unstable'].includes(status)) return 'status-unstable';
if (['reconciling'].includes(status)) return 'status-reconciling';
if (['error', 'down', 'failed'].includes(status)) return 'status-error';
return '';
}
async function refreshData() {
// Refresh summary always
const summary = await fetchData('/summary');
if (summary) {
const statusEl = document.getElementById('summary-status');
statusEl.textContent = `System Status: ${summary.status.toUpperCase()}`;
statusEl.className = 'sidebar-footer ' + getStatusClass(summary.status);
// Handle stale state
const staleBanner = document.getElementById('stale-banner');
if (summary.stale) {
staleBanner.classList.remove('hidden');
staleBanner.textContent = `CRITICAL: Runtime state is STALE (Last update: ${formatTime(summary.last_update)})`;
} else {
staleBanner.classList.add('hidden');
}
if (currentView === 'dashboard') {
const dashSummary = document.getElementById('dashboard-summary');
dashSummary.innerHTML = `
<div class="label">Nodes</div><div class="value">${summary.node_count}</div>
<div class="label">Services</div><div class="value">${summary.service_count}</div>
<div class="label">Last Update</div><div class="value">${formatTime(summary.last_update)}</div>
`;
}
}
if (currentView === 'dashboard' || currentView === 'actions') {
const actions = await fetchData('/actions');
if (actions) {
if (currentView === 'dashboard') {
const dashActions = document.getElementById('dashboard-actions-summary');
const pendingCount = actions.pending.length;
dashActions.innerHTML = `
<div class="label">Pending</div><div class="value" style="color:var(--guarded)">${pendingCount}</div>
<div class="label">Running</div><div class="value" style="color:var(--reconciling)">${actions.running.length}</div>
`;
}
if (currentView === 'actions') {
const pendingEl = document.getElementById('actions-pending');
const historyEl = document.getElementById('actions-history');
pendingEl.innerHTML = actions.pending.map(a => `
<div class="card" style="margin-bottom:12px">
<div class="card-header">
<div class="card-title">${(a.action_type || a.type || 'unknown').toUpperCase()}</div>
<span class="badge risk-${a.risk_level}">${a.risk_level}</span>
</div>
<p>${a.description || a.action_type || 'No description'}</p>
<div class="label">Target</div><div class="value">${a.node || (a.target && a.target.node) || 'unknown'} ${(a.service || (a.target && a.target.service)) || ''}</div>
<div class="label">Confidence</div><div class="value">${Math.round((a.confidence || 0)*100)}%</div>
<div class="controls">
<button class="btn-primary" onclick="mutateAction('${a.id}', 'approved')">Approve</button>
<button onclick="mutateAction('${a.id}', 'rejected')">Reject</button>
</div>
</div>
`).join('') || 'No pending actions.';
const history = [...actions.approved, ...actions.running, ...actions.completed, ...actions.failed, ...actions.rejected];
historyEl.innerHTML = history.sort((a,b) => (b.timestamp || b.updated_at || 0) - (a.timestamp || a.updated_at || 0)).map(a => `
<div class="event">
<div class="event-header">
<span>${(a.action_type || a.type || 'unknown').toUpperCase()}</span>
<span class="badge ${getStatusClass(a.status)}">${a.status}</span>
</div>
<div>${a.description || a.action_type || 'No description'}</div>
<small>${formatTime(a.timestamp || a.updated_at)} | Target: ${a.node || (a.target && a.target.node)}</small>
${a.status === 'approved' ? `<div class="controls"><button class="btn-primary" onclick="mutateAction('${a.id}', 'running')">Execute</button></div>` : ''}
${a.transition_history ? `
<div style="margin-top:8px; font-size:10px; color:var(--text-muted)">
<strong>Trace:</strong> ${a.transition_history.map(h => `${h.from}->${h.to}`).join(' → ')}
</div>
` : ''}
</div>
`).join('') || 'No history.';
}
}
}
if (currentView === 'dashboard' || currentView === 'events') {
const incidents = await fetchData('/incidents');
if (currentView === 'dashboard') {
const dashIncidents = document.getElementById('dashboard-incidents');
if (!incidents || incidents.length === 0) {
dashIncidents.textContent = 'No active incidents.';
} else {
dashIncidents.innerHTML = incidents.map(inc => `
<div class="event ${inc.severity}">
<strong>${inc.severity.toUpperCase()}:</strong> ${inc.message}<br>
<small>${formatTime(inc.timestamp)} | Node: ${inc.node}</small>
</div>
`).join('');
}
}
}
if (currentView === 'nodes') {
const nodes = await fetchData('/nodes');
const list = document.getElementById('nodes-list');
list.innerHTML = nodes.map(node => `
<div class="card">
<div class="card-header">
<div class="card-title">${node.hostname}</div>
<span class="badge ${getStatusClass(node.health)}">${node.health}</span>
</div>
<div class="label">ID</div><div class="value mono">${node.id}</div>
<div class="label">Capabilities</div><div class="value">${node.capabilities.join(', ')}</div>
<div class="label">Connectivity</div><div class="value">${node.connectivity}</div>
<div class="label">Incidents (24h)</div><div class="value">${node.incidents}</div>
<div class="label">Last Seen</div><div class="value">${formatTime(node.last_seen)}</div>
<div class="label">Runtime Status</div><div class="value">${node.status}</div>
</div>
`).join('');
}
if (currentView === 'services') {
const services = await fetchData('/services');
const list = document.getElementById('services-list');
list.innerHTML = services.map(svc => `
<div class="card">
<div class="card-header">
<div class="card-title">${svc.name}</div>
<span class="badge ${getStatusClass(svc.health)}">${svc.health}</span>
</div>
<div class="label">State (Desired/Actual)</div><div class="value">${svc.desired_state} / ${svc.actual_state}</div>
<div class="label">Deployment</div><div class="value">${svc.deployment_state}</div>
<div class="label">Dependencies</div><div class="value">${svc.dependencies.join(', ') || 'None'}</div>
<div class="label">Recommendations</div><div class="value">${svc.recommendations.join(', ') || 'None'}</div>
</div>
`).join('');
}
if (currentView === 'deployments') {
const deps = await fetchData('/deployments');
const list = document.getElementById('deployments-list');
list.innerHTML = deps.map(dep => `
<div class="card">
<div class="card-header">
<div class="card-title">${dep.service}</div>
<span class="badge ${dep.status === 'failed' ? 'status-error' : 'status-reconciling'}">${dep.status}</span>
</div>
<div class="label">ID</div><div class="value mono">${dep.id}</div>
<div class="label">Stage</div><div class="value">${dep.stage}</div>
<div class="label">Diagnostics</div><div class="value">${dep.diagnostics || 'No data'}</div>
<div class="label">Resumable</div><div class="value">${dep.resumable ? 'Yes' : 'No'}</div>
${dep.resumable ? '<button class="btn-primary">Resume</button>' : ''}
</div>
`).join('');
}
if (currentView === 'events') {
const events = await fetchData('/events');
const timeline = document.getElementById('events-timeline');
timeline.innerHTML = events.map(ev => `
<div class="event ${ev.severity}">
<div class="event-header">
<span>${ev.type.toUpperCase()}</span>
<span>${formatTime(ev.timestamp)}</span>
</div>
<div>${ev.message}</div>
<div class="label" style="margin-top:8px">Node: ${ev.node} ${ev.service ? '| Service: ' + ev.service : ''}</div>
</div>
`).join('');
}
if (currentView === 'recommendations') {
const recs = await fetchData('/recommendations');
const list = document.getElementById('recommendations-list');
list.innerHTML = recs.map(rec => `
<div class="card">
<div class="card-header">
<div class="card-title">${rec.title}</div>
<span class="badge risk-${rec.risk_level}">${rec.risk_level}</span>
</div>
<p>${rec.description}</p>
<div class="label">Confidence</div><div class="value">${Math.round(rec.confidence * 100)}%</div>
<div class="label">Autonomous Eligible</div><div class="value">${rec.autonomous_eligible ? 'Yes' : 'No'}</div>
<div class="label">Blocked Actions</div><div class="value">${rec.blocked_actions.join(', ') || 'None'}</div>
<div class="controls">
<button class="btn-primary" ${rec.risk_level === 'dangerous' ? 'style="background:var(--dangerous)"' : ''}>Approve Action</button>
</div>
</div>
`).join('');
}
if (currentView === 'topology') {
const nodes = await fetchData('/nodes');
const services = await fetchData('/services');
const topMap = document.getElementById('topology-map');
if (nodes && services) {
topMap.innerHTML = nodes.map(node => {
const nodeServices = services.filter(s => s.node === node.hostname || s.node === node.id);
return `
<div class="card" style="width:250px; border: 1px solid ${node.health === 'nominal' ? 'var(--border-color)' : 'var(--error)'}">
<div class="card-header">
<div class="card-title">${node.hostname}</div>
<span class="badge ${getStatusClass(node.health)}">${node.health}</span>
</div>
<div class="label">Capabilities</div>
<div class="value" style="font-size:11px">${node.capabilities.join(', ')}</div>
<div class="label">Services</div>
<div style="font-size:12px; margin-bottom:10px">
${nodeServices.length > 0 ? nodeServices.map(s => `
<div style="display:flex; justify-content:space-between; margin-bottom:4px; padding:4px; background:rgba(255,255,255,0.03)">
<span>${s.name}</span>
<span class="${getStatusClass(s.health)}" style="font-size:10px">${s.health}</span>
</div>
${s.dependencies.length > 0 ? `<div style="font-size:9px; color:var(--text-muted); margin-left:8px; margin-bottom:4px">dep: ${s.dependencies.join(', ')}</div>` : ''}
`).join('') : '<div class="value">None</div>'}
</div>
</div>
`;
}).join('');
}
}
if (currentView === 'correlation') {
const incidents = await fetchData('/incidents');
const actions = await fetchData('/actions');
const list = document.getElementById('correlation-chains');
if (incidents && actions) {
const allActions = Object.values(actions).flat();
list.innerHTML = incidents.map(inc => {
const related = allActions.filter(a => a.correlation_chain && a.correlation_chain.includes(inc.id));
return `
<div class="card">
<div class="card-header">
<div class="card-title">Incident: ${inc.id || 'INC-001'}</div>
<span class="badge status-error">Active</span>
</div>
<p>${inc.message}</p>
<div class="label">Related Actions</div>
${related.map(a => `
<div class="event" style="margin-top:5px">
<strong>${a.type}</strong> (${a.status})<br>
<small>${a.description}</small>
</div>
`).join('') || '<div class="value">No actions yet</div>'}
</div>
`;
}).join('');
}
}
if (currentView === 'settings') {
const config = await fetchData('/config');
const content = document.getElementById('settings-content');
content.innerHTML = `
<div class="label">Auto Mode</div>
<div class="value">${config.auto_mode ? 'Enabled' : 'Disabled'}</div>
<div class="label">Action Thresholds</div>
<div class="value mono">${JSON.stringify(config.action_thresholds, null, 2)}</div>
<div class="label">Telegram Integration</div>
<div class="value" style="color:var(--text-muted)">Ready for mobile approval flows. Hook: /api/v1/telegram/webhook</div>
<button onclick="alert('Settings update not implemented in this demo')">Edit Configuration</button>
`;
}
}
async function copyForAI() {
const btn = document.getElementById('copy-ai-btn');
const original = btn.textContent;
btn.textContent = 'Copying...';
btn.disabled = true;
try {
const snap = await fetchData('/snapshot');
if (!snap) throw new Error('snapshot fetch failed');
const now = new Date(snap.timestamp);
const dateStr = now.toISOString().slice(0, 16).replace('T', ' ');
const lines = [];
lines.push(`=== HOMELAB SNAPSHOT ${dateStr} ===`);
if (snap.nodes && snap.nodes.length > 0) {
lines.push('NODES: ' + snap.nodes.map(n =>
`${(n.hostname || n.id || '?').toUpperCase()} ${(n.health || 'unknown').toUpperCase()}`
).join(', '));
} else {
lines.push('NODES: none');
}
if (snap.non_nominal_services && snap.non_nominal_services.length > 0) {
lines.push('ERRORS: ' + snap.non_nominal_services.map(s =>
`${s.name} (${s.node}) - ${s.health}`
).join(', '));
} else {
lines.push(`ERRORS: none (${snap.nominal_service_count} nominal)`);
}
const activeIncidents = (snap.incidents || []).filter(i => !['resolved', 'closed'].includes(i.status));
if (activeIncidents.length > 0) {
lines.push('INCIDENTS: ' + activeIncidents.map(i =>
`[${i.severity}] ${i.message} (${i.node})`
).join('; '));
} else {
lines.push('INCIDENTS: none');
}
if (snap.events && snap.events.length > 0) {
lines.push(`EVENTS (last ${snap.events.length}):`);
snap.events.forEach(ev => {
const ts = ev.timestamp
? new Date(ev.timestamp * 1000).toISOString().slice(11, 19)
: '?';
const svc = ev.service ? '/' + ev.service : '';
lines.push(` ${ts} [${ev.severity || ev.level || '?'}] ${ev.type} - ${ev.message || ''} (${ev.node || ''}${svc})`);
});
} else {
lines.push('EVENTS (last 10): none');
}
const s = snap.summary || {};
lines.push(`SUMMARY: status=${s.status || '?'} nodes=${s.node_count ?? '?'} services=${s.service_count ?? '?'} incidents=${s.incident_count ?? '?'}`);
await navigator.clipboard.writeText(lines.join('\n'));
btn.textContent = 'Copied!';
setTimeout(() => { btn.textContent = original; btn.disabled = false; }, 2000);
} catch (e) {
console.error('copyForAI error:', e);
btn.textContent = 'Error';
setTimeout(() => { btn.textContent = original; btn.disabled = false; }, 2000);
}
}
// Initial load
refreshData();
// Poll for updates
setInterval(refreshData, pollInterval);
</script>
</body>
</html>

View file

@ -0,0 +1,301 @@
import json
import os
import time
from datetime import datetime, timezone
from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
from pathlib import Path
STATE_DIR = Path(os.getenv("HOMELAB_STATE_ROOT", "/opt/homelab/state"))
EVENTS_DIR = Path(os.getenv("HOMELAB_EVENTS_ROOT", "/opt/homelab/events"))
WORLD_DIR = Path(os.getenv("HOMELAB_WORLD_ROOT", "/opt/homelab/world"))
ACTIONS_DIR = Path(os.getenv("HOMELAB_ACTIONS_ROOT", "/opt/homelab/actions"))
CONFIG_DIR = Path(os.getenv("HOMELAB_CONFIG_ROOT", "/opt/homelab/config"))
STATIC_DIR = Path(__file__).parent
DEFAULT_CONFIG = {
"operator_mode": "approval",
"auto_mode": True,
"action_thresholds": {
"restart_ha": 0.8,
"check_network": 0.9,
},
"default_threshold": 0.9,
"allowed_auto_actions": ["restart_ha"],
}
def read_json_file(path, default=None):
if not path.exists():
return default if default is not None else []
try:
return json.loads(path.read_text())
except Exception:
return default if default is not None else []
def get_config():
config_path = STATE_DIR / "operator-config.json"
if config_path.exists():
return read_json_file(config_path, DEFAULT_CONFIG)
return DEFAULT_CONFIG
def save_config(config):
STATE_DIR.mkdir(parents=True, exist_ok=True)
(STATE_DIR / "operator-config.json").write_text(json.dumps(config, indent=2))
def current_nodes():
return read_json_file(WORLD_DIR / "nodes.json")
def current_services():
return read_json_file(WORLD_DIR / "services.json")
def current_deployments():
return read_json_file(WORLD_DIR / "deployments.json")
def current_incidents():
return read_json_file(WORLD_DIR / "incidents.json")
def current_recommendations():
return read_json_file(WORLD_DIR / "recommendations.json")
def current_summary():
path = WORLD_DIR / "runtime-summary.json"
summary = read_json_file(path, default={})
if summary:
last_update_val = summary.get("last_update")
if last_update_val:
try:
if isinstance(last_update_val, str):
last_update = datetime.fromisoformat(last_update_val.replace('Z', '+00:00')).timestamp()
else:
last_update = float(last_update_val)
except Exception:
last_update = os.path.getmtime(path)
else:
last_update = os.path.getmtime(path)
summary["last_update"] = last_update
summary["stale"] = (time.time() - last_update) > 60
return summary
def current_events():
return read_json_file(WORLD_DIR / "events.json", default=[])
def current_actions():
actions = {}
statuses = ["pending", "approved", "running", "completed", "failed", "rejected"]
for status in statuses:
actions[status] = []
status_dir = ACTIONS_DIR / status
if status_dir.exists():
for f in status_dir.glob("*.json"):
data = read_json_file(f)
if data:
# Injects some metadata for UI
data["id"] = data.get("action_id") or f.stem
data["status"] = status
actions[status].append(data)
return actions
def mutate_action(action_id, target_status):
statuses = ["pending", "approved", "running", "completed", "failed", "rejected"]
if target_status not in statuses:
return False, f"Invalid target status: {target_status}"
# Find where the action is
source_path = None
current_status = None
for status in statuses:
p = ACTIONS_DIR / status / f"{action_id}.json"
if p.exists():
source_path = p
current_status = status
break
if not source_path:
return False, f"Action {action_id} not found"
target_dir = ACTIONS_DIR / target_status
target_dir.mkdir(parents=True, exist_ok=True)
target_path = target_dir / f"{action_id}.json"
try:
data = json.loads(source_path.read_text())
data["status"] = target_status
data["updated_at"] = time.time()
# Keep history of transitions
history = data.get("transition_history", [])
history.append({
"from": current_status,
"to": target_status,
"timestamp": time.time()
})
data["transition_history"] = history
target_path.write_text(json.dumps(data, indent=2))
if source_path != target_path:
source_path.unlink()
return True, "Success"
except Exception as e:
return False, str(e)
def get_snapshot():
nodes = current_nodes()
services = current_services()
incidents = current_incidents()
events = current_events()
summary = current_summary()
non_nominal = [s for s in services if s.get("health") != "nominal"]
nominal_count = len(services) - len(non_nominal)
return {
"timestamp": datetime.now(timezone.utc).isoformat(),
"summary": summary,
"nodes": nodes,
"non_nominal_services": non_nominal,
"nominal_service_count": nominal_count,
"total_service_count": len(services),
"incidents": incidents,
"events": events[:10],
}
def send_json(status, payload, handler):
body = (json.dumps(payload) + "\n").encode("utf-8")
handler.send_response(status)
handler.send_header("Content-Type", "application/json")
handler.send_header("Content-Length", str(len(body)))
handler.end_headers()
handler.wfile.write(body)
class Handler(BaseHTTPRequestHandler):
def do_GET(self):
if self.path == "/config":
send_json(200, get_config(), self)
return
if self.path == "/nodes":
send_json(200, current_nodes(), self)
return
if self.path == "/services":
send_json(200, current_services(), self)
return
if self.path == "/deployments":
send_json(200, current_deployments(), self)
return
if self.path == "/incidents":
send_json(200, current_incidents(), self)
return
if self.path == "/recommendations":
send_json(200, current_recommendations(), self)
return
if self.path == "/summary":
send_json(200, current_summary(), self)
return
if self.path == "/events":
send_json(200, current_events(), self)
return
if self.path == "/actions":
send_json(200, current_actions(), self)
return
if self.path == "/snapshot":
send_json(200, get_snapshot(), self)
return
if self.path in ("/", "/index.html"):
body = (STATIC_DIR / "index.html").read_bytes()
self.send_response(200)
self.send_header("Content-Type", "text/html; charset=utf-8")
self.send_header("Content-Length", str(len(body)))
self.end_headers()
self.wfile.write(body)
return
self.send_error(404)
def do_POST(self):
if self.path not in (
"/config",
"/action/mutate",
"/mode",
):
self.send_error(404)
return
length = int(self.headers.get("Content-Length", "0"))
raw_body = self.rfile.read(length).decode("utf-8")
try:
payload = json.loads(raw_body)
except json.JSONDecodeError:
self.send_error(400, "Invalid JSON")
return
if self.path == "/config":
config = get_config()
config.update(payload)
save_config(config)
send_json(200, {"status": "ok"}, self)
return
if self.path == "/mode":
mode = payload.get("mode")
if not mode:
self.send_error(400, "mode is required")
return
config = get_config()
config["operator_mode"] = mode
save_config(config)
send_json(200, {"status": "ok"}, self)
return
if self.path == "/action/mutate":
action_id = payload.get("id")
target = payload.get("status")
if not action_id or not target:
self.send_error(400, "id and status are required")
return
success, msg = mutate_action(action_id, target)
if success:
send_json(200, {"status": "ok"}, self)
else:
self.send_error(500, msg)
return
def log_message(self, format, *args):
return
if __name__ == "__main__":
# Ensure directories exist
for d in [STATE_DIR, EVENTS_DIR, WORLD_DIR, ACTIONS_DIR, CONFIG_DIR]:
d.mkdir(parents=True, exist_ok=True)
for s in ["pending", "approved", "running", "completed", "failed", "rejected"]:
(ACTIONS_DIR / s).mkdir(parents=True, exist_ok=True)
port = int(os.getenv("PORT", "8080"))
print(f"Operator Control Plane starting on 0.0.0.0:{port}")
server = ThreadingHTTPServer(("0.0.0.0", port), Handler)
server.serve_forever()

View file

@ -0,0 +1,10 @@
FROM python:3.11-slim
WORKDIR /app
COPY src/ src/
ENV PYTHONUNBUFFERED=1
ENV PYTHONPATH=/app/src
CMD ["python", "-m", "brain_watchdog.main"]

View file

@ -0,0 +1,30 @@
services:
brain-watchdog:
build: .
container_name: brain-watchdog
restart: unless-stopped
env_file:
- /opt/homelab/config/brain-watchdog/.env
volumes:
- brain_watchdog_data:/data
healthcheck:
test:
- "CMD"
- "python"
- "-c"
- |
import os, time, json, sys
p = '/data/state.json'
if not os.path.exists(p): sys.exit(1)
age = time.time() - os.path.getmtime(p)
sys.exit(0 if age < 300 else 1)
interval: 1m
timeout: 10s
retries: 3
start_period: 30s
volumes:
brain_watchdog_data:

View file

@ -0,0 +1,7 @@
CONTROL_PLANE_URL=
STALE_THRESHOLD=600
INTERVAL=60
FAILS_BEFORE_ALERT=3
TG_TOKEN=
TG_CHAT_ID=
HEALTHCHECKS_URL=

View file

@ -0,0 +1,10 @@
#!/bin/sh
# Healthy if state.json was written within the last 5 minutes.
python -c "
import os, time, sys
p = '/data/state.json'
if not os.path.exists(p):
sys.exit(1)
age = time.time() - os.path.getmtime(p)
sys.exit(0 if age < 300 else 1)
"

View file

@ -0,0 +1,3 @@
[pytest]
pythonpath = src
testpaths = tests

View file

@ -0,0 +1,34 @@
service:
name: brain-watchdog
owner_node: piha
exposure: private
description: >
External watchdog for the control-plane on VPS. Queries /summary over
Tailscale and alerts via Telegram Bot API directly — no dependency on the
control-plane itself. Freshness is computed locally from last_update epoch.
dependencies:
- control-plane # external — on VPS; deliberately untrusted for liveness
healthcheck:
type: docker
interval: 60s
timeout: 10s
retries: 3
start_period: 30s
restart_policy: unless-stopped
persistence:
paths:
- /data # state.json: fail_count, alerted, last_ok
runtime:
env_vars:
- CONTROL_PLANE_URL # Tailscale IP + port of operator-ui (required)
- STALE_THRESHOLD # seconds before brain is considered stale (default: 600)
- INTERVAL # poll interval seconds (default: 60)
- FAILS_BEFORE_ALERT # consecutive failures before Telegram alert (default: 3)
- TG_TOKEN # Telegram Bot API token (required)
- TG_CHAT_ID # Telegram chat/user ID (required)
- HEALTHCHECKS_URL # optional healthchecks.io ping URL

View file

@ -0,0 +1,157 @@
"""
brain-watchdog: external watchdog for the control-plane on VPS.
Runs on PIHA; queries /summary directly over Tailscale and alerts via
Telegram Bot API without going through the control-plane itself.
Never trusts the self-reported "status" field freshness is computed
locally from last_update epoch vs. time.time().
"""
import json
import os
import time
import urllib.error
import urllib.request
from pathlib import Path
CONTROL_PLANE_URL = os.environ["CONTROL_PLANE_URL"].rstrip("/")
STALE_THRESHOLD = int(os.environ.get("STALE_THRESHOLD", "600"))
INTERVAL = int(os.environ.get("INTERVAL", "60"))
FAILS_BEFORE_ALERT = int(os.environ.get("FAILS_BEFORE_ALERT", "3"))
TG_TOKEN = os.environ["TG_TOKEN"]
TG_CHAT_ID = os.environ["TG_CHAT_ID"]
HEALTHCHECKS_URL = os.environ.get("HEALTHCHECKS_URL", "").strip()
STATE_FILE = Path("/data/state.json")
def load_state() -> dict:
if STATE_FILE.exists():
try:
return json.loads(STATE_FILE.read_text())
except Exception:
pass
return {"fail_count": 0, "alerted": False, "last_ok": 0.0}
def save_state(state: dict) -> None:
STATE_FILE.parent.mkdir(parents=True, exist_ok=True)
STATE_FILE.write_text(json.dumps(state))
def http_get(url: str, timeout: int = 10) -> tuple[int | None, dict | None]:
try:
with urllib.request.urlopen(url, timeout=timeout) as resp:
return resp.status, json.loads(resp.read())
except urllib.error.HTTPError as exc:
return exc.code, None
except Exception:
return None, None
def send_telegram(message: str) -> bool:
url = f"https://api.telegram.org/bot{TG_TOKEN}/sendMessage"
payload = json.dumps(
{"chat_id": TG_CHAT_ID, "text": message, "parse_mode": "HTML"}
).encode()
req = urllib.request.Request(
url, data=payload, headers={"Content-Type": "application/json"}
)
try:
with urllib.request.urlopen(req, timeout=10) as resp:
return resp.status == 200
except Exception as exc:
print(f"[telegram] send failed: {exc}", flush=True)
return False
def ping_healthchecks() -> None:
if not HEALTHCHECKS_URL:
return
try:
urllib.request.urlopen(HEALTHCHECKS_URL, timeout=10)
except Exception as exc:
print(f"[healthchecks] ping failed: {exc}", flush=True)
def check() -> tuple[bool, str]:
"""Return (ok, human-readable reason). Never reads 'status' field."""
status, body = http_get(f"{CONTROL_PLANE_URL}/summary")
if status is None:
return False, "panel unreachable (connection error)"
if status != 200:
return False, f"panel returned HTTP {status}"
if not body:
return False, "panel returned empty / invalid JSON"
raw = body.get("last_update")
if raw is None:
return False, "summary missing last_update field"
try:
last_update_ts = float(raw)
except (TypeError, ValueError):
return False, f"last_update not parseable: {raw!r}"
age = time.time() - last_update_ts
if age > STALE_THRESHOLD:
return False, (
f"brain stale: last update {int(age // 60)}m ago "
f"(threshold {STALE_THRESHOLD // 60}m)"
)
return True, f"ok (age {int(age)}s)"
def main() -> None:
print(
f"[brain-watchdog] starting — "
f"url={CONTROL_PLANE_URL} "
f"stale_threshold={STALE_THRESHOLD}s "
f"interval={INTERVAL}s "
f"fails_before_alert={FAILS_BEFORE_ALERT}",
flush=True,
)
state = load_state()
while True:
ok, reason = check()
ts = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
print(f"[{ts}] {'OK ' if ok else 'FAIL'}{reason}", flush=True)
if ok:
if state["alerted"]:
send_telegram(
"✅ <b>brain-watchdog: control-plane RECOVERED</b>\n"
f"{reason}"
)
print("[telegram] sent recovery alert", flush=True)
state["fail_count"] = 0
state["alerted"] = False
state["last_ok"] = time.time()
save_state(state)
ping_healthchecks()
else:
state["fail_count"] = state.get("fail_count", 0) + 1
save_state(state)
if state["fail_count"] >= FAILS_BEFORE_ALERT and not state["alerted"]:
sent = send_telegram(
"🚨 <b>brain-watchdog: control-plane DOWN</b>\n"
f"Reason: {reason}\n"
f"Consecutive failures: {state['fail_count']}\n"
f"URL: <code>{CONTROL_PLANE_URL}</code>"
)
if sent:
state["alerted"] = True
save_state(state)
print("[telegram] sent alert", flush=True)
time.sleep(INTERVAL)
if __name__ == "__main__":
main()

View file

@ -0,0 +1,66 @@
"""
Tests for brain_watchdog.main.
Module-level env vars are required at import time; set them before the first
import of the module so tests can run without a real control-plane.
"""
import importlib.util
import os
import time
from unittest.mock import patch
os.environ.setdefault("CONTROL_PLANE_URL", "http://test-cp:8080")
os.environ.setdefault("TG_TOKEN", "test_token")
os.environ.setdefault("TG_CHAT_ID", "12345")
import brain_watchdog.main as bwm
def test_package_importable():
spec = importlib.util.find_spec("brain_watchdog")
assert spec is not None
def test_check_ok_fresh():
now = time.time()
with patch.object(bwm, "http_get", return_value=(200, {"last_update": now - 10})):
ok, reason = bwm.check()
assert ok
assert "ok" in reason
def test_check_fail_stale():
now = time.time()
stale_ts = now - (bwm.STALE_THRESHOLD + 120)
with patch.object(bwm, "http_get", return_value=(200, {"last_update": stale_ts})):
ok, reason = bwm.check()
assert not ok
assert "stale" in reason
def test_check_fail_unreachable():
with patch.object(bwm, "http_get", return_value=(None, None)):
ok, reason = bwm.check()
assert not ok
assert "unreachable" in reason
def test_check_fail_http_error():
with patch.object(bwm, "http_get", return_value=(503, None)):
ok, reason = bwm.check()
assert not ok
assert "503" in reason
def test_check_fail_missing_last_update():
with patch.object(bwm, "http_get", return_value=(200, {"other": "data"})):
ok, reason = bwm.check()
assert not ok
assert "last_update" in reason
def test_check_fail_unparseable_timestamp():
with patch.object(bwm, "http_get", return_value=(200, {"last_update": "not-a-number"})):
ok, reason = bwm.check()
assert not ok
assert "parseable" in reason

View file

@ -0,0 +1,24 @@
FROM python:3.11-slim
WORKDIR /app
RUN pip install --no-cache-dir pyyaml
# Create homelab user
RUN useradd -m -u 1000 homelab
# Copy sources
COPY src/ /app/src/
# Also need the observer script if we want to run it from here,
# but I'll copy it from the repo during build or mount it.
# Actually, I'll copy the entire scripts/ directory to /repo/scripts
# so the supervisor/executor can find them.
# For simplicity, we'll assume the repo is mounted at /repo
ENV REPO_ROOT=/repo
ENV RUNTIME_PATH=/opt/homelab
ENV PYTHONUNBUFFERED=1
# Default command (will be overridden in docker-compose)
USER homelab
CMD ["python", "src/operator_ui.py"]

View file

@ -0,0 +1,73 @@
#!/bin/bash
# services/control-plane/deploy-local.sh
set -e
# 1. Validate it is deploying control-plane
if [[ ! $(pwd) == *"/services/control-plane" ]]; then
echo "Error: Script must be run from services/control-plane directory"
exit 1
fi
if [[ ! -f "docker-compose.yml" ]]; then
echo "Error: docker-compose.yml not found"
exit 1
fi
echo "--- Preparing Control Plane Directories ---"
# 2. Prepare required dirs
# /opt/homelab/config
# /opt/homelab/actions/{pending,approved,rejected,running,completed,failed}
# /opt/homelab/world
# /opt/homelab/state
DIRS=(
"/opt/homelab/config"
"/opt/homelab/actions/pending"
"/opt/homelab/actions/approved"
"/opt/homelab/actions/rejected"
"/opt/homelab/actions/running"
"/opt/homelab/actions/completed"
"/opt/homelab/actions/failed"
"/opt/homelab/world"
"/opt/homelab/state"
)
for dir in "${DIRS[@]}"; do
if [ ! -d "$dir" ]; then
echo "Creating $dir"
sudo mkdir -p "$dir"
fi
done
# 3. chown/chmod for UID 1000 — self-healing: only calls sudo when actually needed
echo "Checking /opt/homelab ownership..."
_chown_needed=$(find /opt/homelab \( ! -uid 1000 -o ! -gid 1000 \) -print -quit 2>/dev/null)
if [[ -n "$_chown_needed" ]]; then
echo "Found files not owned by 1000:1000 (e.g. $_chown_needed) — fixing..."
sudo chown -R 1000:1000 /opt/homelab
else
echo "Ownership already correct, skipping chown"
fi
echo "Checking /opt/homelab directory permissions..."
_chmod_needed=$(find /opt/homelab -type d ! -perm -775 -print -quit 2>/dev/null)
if [[ -n "$_chmod_needed" ]]; then
echo "Found directories with wrong permissions (e.g. $_chmod_needed) — fixing..."
sudo chmod -R 775 /opt/homelab 2>/dev/null || true
else
echo "Permissions already correct, skipping chmod"
fi
# 4. Run docker compose up -d --build --force-recreate
echo "--- Starting Control Plane Services ---"
COMPOSE_ARGS="-f docker-compose.yml"
OVERRIDE_FILE="../../hosts/vps/runtime/control-plane/docker-compose.override.yml"
if [ -f "$OVERRIDE_FILE" ]; then
echo "Using override: $OVERRIDE_FILE"
COMPOSE_ARGS="$COMPOSE_ARGS -f $OVERRIDE_FILE"
fi
docker compose $COMPOSE_ARGS up -d --build --force-recreate
# 5. Print docker ps for control-plane containers
echo "--- Deployment Status ---"
docker ps --filter "name=control-plane"

View file

@ -0,0 +1,76 @@
services:
operator-ui:
build: .
container_name: control-plane-ui
user: "1000:1000"
command: python src/operator_ui.py
ports:
- "18180:8080"
volumes:
- /opt/homelab:/opt/homelab
restart: unless-stopped
healthcheck:
test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://127.0.0.1:8080/', timeout=3).read()"]
interval: 30s
timeout: 10s
retries: 3
observer:
build: .
container_name: control-plane-observer
user: "1000:1000"
command: python /repo/scripts/observer/observer.py
volumes:
- /opt/homelab:/opt/homelab
- ../..:/repo:ro
restart: unless-stopped
environment:
- REPO_ROOT=/repo
- RUNTIME_PATH=/opt/homelab
healthcheck:
test: ["CMD", "test", "-f", "/opt/homelab/state/observer.heartbeat"]
interval: 30s
timeout: 5s
retries: 3
start_period: 5s
supervisor:
build: .
container_name: control-plane-supervisor
user: "1000:1000"
command: python src/supervisor.py
volumes:
- /opt/homelab:/opt/homelab
- ../..:/repo:ro
restart: unless-stopped
environment:
- REPO_ROOT=/repo
- RUNTIME_PATH=/opt/homelab
healthcheck:
test: ["CMD", "test", "-f", "/opt/homelab/state/supervisor.heartbeat"]
interval: 60s
timeout: 5s
retries: 3
start_period: 10s
executor:
build: .
container_name: control-plane-executor
user: "1000:1000"
group_add:
- "999"
command: python src/executor.py
volumes:
- /opt/homelab:/opt/homelab
- ../..:/repo
- /var/run/docker.sock:/var/run/docker.sock
restart: unless-stopped
environment:
- REPO_ROOT=/repo
- RUNTIME_PATH=/opt/homelab
healthcheck:
test: ["CMD", "test", "-f", "/opt/homelab/state/executor.heartbeat"]
interval: 30s
timeout: 5s
retries: 3
start_period: 5s

View file

@ -0,0 +1,19 @@
[build-system]
requires = ["setuptools>=68"]
build-backend = "setuptools.build_meta"
[project]
name = "control-plane"
version = "0.1.0"
requires-python = ">=3.11"
dependencies = [
"pyyaml>=6.0",
]
[project.optional-dependencies]
dev = [
"pytest>=8.1",
]
[tool.pytest.ini_options]
testpaths = ["tests"]

View file

@ -0,0 +1,246 @@
import os
import json
import time
import logging
import subprocess
from pathlib import Path
def _atomic_write_json(path: Path, data) -> None:
"""Write JSON atomically: write to a sibling .tmp, fsync, then os.replace."""
tmp = path.with_suffix(".tmp")
with open(tmp, "w") as f:
json.dump(data, f, indent=2)
f.flush()
os.fsync(f.fileno())
os.replace(tmp, path)
# Constants and Paths
RUNTIME_PATH = os.getenv("RUNTIME_PATH", "/opt/homelab")
ACTIONS_DIR = Path(RUNTIME_PATH) / "actions"
REPO_ROOT = Path(os.getenv("REPO_ROOT", "/repo"))
# SSH configuration
# SSH_USER can be overridden per-deployment environment.
SSH_USER = os.getenv("SSH_USER", "oskar")
SSH_OPTIONS = [
"-o", "StrictHostKeyChecking=no",
"-o", "ConnectTimeout=10",
"-o", "BatchMode=yes",
]
# Logging setup
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger("executor")
class Executor:
def __init__(self):
self._ensure_dirs()
def _ensure_dirs(self):
for s in ["approved", "running", "completed", "failed", "rejected"]:
(ACTIONS_DIR / s).mkdir(parents=True, exist_ok=True)
def process_actions(self):
# Update heartbeat
heartbeat_file = ACTIONS_DIR.parent / "state" / "executor.heartbeat"
try:
heartbeat_file.touch()
except Exception as e:
logger.error(f"Failed to touch heartbeat file: {e}")
approved_dir = ACTIONS_DIR / "approved"
action_files = sorted(approved_dir.glob("*.json"))
for action_file in action_files:
self._execute_action(action_file)
def _execute_action(self, action_file):
action_id = action_file.stem
logger.info(f"Executing action: {action_id}")
# Move to running
running_path = ACTIONS_DIR / "running" / f"{action_id}.json"
try:
with open(action_file, "r") as f:
data = json.load(f)
data["status"] = "running"
data["started_at"] = time.time()
_atomic_write_json(running_path, data)
action_file.unlink()
except Exception as e:
logger.error(f"Failed to move {action_id} to running: {e}")
return
# Dispatch by action type
success = False
error_msg = ""
try:
action_type = data.get("type")
node = data.get("node")
service = data.get("service")
if action_type == "redeploy":
# Full service redeploy via the repo deploy script
cmd = [
str(REPO_ROOT / "scripts" / "deploy" / "deploy-node.sh"),
node,
service
]
logger.info(f"Running command: {' '.join(cmd)}")
result = subprocess.run(cmd, capture_output=True, text=True, cwd=str(REPO_ROOT))
if result.returncode == 0:
success = True
else:
success = False
error_msg = result.stderr or result.stdout
elif action_type == "container_restart":
# Lightweight restart: SSH to node and docker restart the container.
# container_name is set by the supervisor; falls back to service name.
container_name = data.get("container_name") or service
success, error_msg = self._execute_container_restart(node, container_name)
elif action_type == "disk_cleanup":
# Operator-approved aggressive Docker cleanup (image prune -a +
# volume prune). Commands come from the action payload so the
# supervisor controls exactly what runs; the executor adds a
# safety check to reject anything touching protected paths.
payload = data.get("payload", {})
success, error_msg = self._execute_disk_cleanup(node, payload)
elif action_type == "alert_only":
# Operator acknowledged the alert; no automated execution needed.
success = True
else:
success = False
error_msg = f"Unknown action type: {action_type}"
except Exception as e:
success = False
error_msg = str(e)
# Move to completed/failed
target_status = "completed" if success else "failed"
target_path = ACTIONS_DIR / target_status / f"{action_id}.json"
try:
data["status"] = target_status
data["finished_at"] = time.time()
if not success:
data["error"] = error_msg
_atomic_write_json(target_path, data)
running_path.unlink()
logger.info(f"Action {action_id} {target_status}")
except Exception as e:
logger.error(f"Failed to move {action_id} to {target_status}: {e}")
def _execute_container_restart(self, node, container_name, retry_delay=10):
"""
SSH to the target node and run `docker restart <container_name>`.
Attempts the restart up to 2 times (initial + 1 retry). If the first
attempt fails, waits retry_delay seconds then tries once more before
declaring the action failed.
Returns (success: bool, error_msg: str).
"""
cmd = [
"ssh",
*SSH_OPTIONS,
f"{SSH_USER}@{node}",
f"docker restart {container_name}",
]
logger.info(f"SSH container restart: {' '.join(cmd)}")
max_attempts = 2
last_error = ""
for attempt in range(1, max_attempts + 1):
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode == 0:
logger.info(
f"Container '{container_name}' on {node} restarted successfully "
f"(attempt {attempt}/{max_attempts})"
)
return True, ""
last_error = (result.stderr or result.stdout).strip()
logger.warning(
f"container_restart attempt {attempt}/{max_attempts} failed "
f"for '{container_name}' on {node}: {last_error}"
)
if attempt < max_attempts:
logger.info(f"Retrying in {retry_delay}s...")
time.sleep(retry_delay)
logger.error(
f"container_restart exhausted all {max_attempts} attempts "
f"for '{container_name}' on {node}"
)
return False, last_error
def _execute_disk_cleanup(self, node: str, payload: dict):
"""
SSH to the target node and run the operator-approved disk cleanup
commands from the action payload.
Safety invariants enforced here regardless of payload content:
- No command may reference /opt/homelab/data/, /opt/homelab/config/,
or /opt/homelab/state/ (application data and configuration).
- No command may contain rm -rf / or similar destructive patterns.
If any command fails the safety check the entire action is rejected
(not run at all) and the rejection reason is recorded.
Returns (success: bool, error_msg: str).
"""
commands = payload.get("commands", [
"docker image prune -a -f",
"docker volume prune -f",
])
# Safety gate: reject commands that touch protected paths
FORBIDDEN = [
"/opt/homelab/data",
"/opt/homelab/config",
"/opt/homelab/state",
"rm -rf /",
]
for cmd in commands:
for forbidden in FORBIDDEN:
if forbidden in cmd:
msg = f"Rejected: command contains forbidden pattern '{forbidden}': {cmd}"
logger.error(msg)
return False, msg
full_command = " && ".join(commands)
cmd = [
"ssh",
*SSH_OPTIONS,
f"{SSH_USER}@{node}",
full_command,
]
logger.info(f"Disk cleanup on {node}: {full_command}")
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode == 0:
logger.info(f"Disk cleanup on {node} succeeded")
return True, ""
error_msg = (result.stderr or result.stdout).strip()
logger.error(f"Disk cleanup on {node} failed: {error_msg}")
return False, error_msg
def loop(self, interval=10):
logger.info("Starting executor loop")
while True:
self.process_actions()
time.sleep(interval)
if __name__ == "__main__":
executor = Executor()
executor.loop()

View file

@ -0,0 +1,701 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Operator Control Plane</title>
<style>
:root {
--bg-color: #0a0c0e;
--sidebar-color: #14171a;
--card-color: #1c2024;
--border-color: #2a3540;
--text-color: #e7edf3;
--text-muted: #94a3b8;
--accent-color: #3eaf7c;
--nominal: #3eaf7c;
--degraded: #e7c000;
--unstable: #e67e22;
--reconciling: #3498db;
--error: #c0392b;
--safe: #3eaf7c;
--guarded: #e67e22;
--dangerous: #c0392b;
}
body {
margin: 0;
font-family: 'Inter', system-ui, -apple-system, sans-serif;
background: var(--bg-color);
color: var(--text-color);
display: flex;
height: 100vh;
overflow: hidden;
}
/* Sidebar */
.sidebar {
width: 240px;
background: var(--sidebar-color);
border-right: 1px solid var(--border-color);
display: flex;
flex-direction: column;
flex-shrink: 0;
}
.sidebar-header {
padding: 24px;
font-weight: 800;
font-size: 14px;
letter-spacing: 0.1em;
color: var(--accent-color);
border-bottom: 1px solid var(--border-color);
}
.nav-list {
list-style: none;
padding: 12px 0;
margin: 0;
flex-grow: 1;
}
.nav-item {
padding: 12px 24px;
cursor: pointer;
font-size: 14px;
color: var(--text-muted);
transition: all 0.2s;
display: flex;
align-items: center;
gap: 12px;
}
.nav-item:hover {
background: rgba(255, 255, 255, 0.05);
color: var(--text-color);
}
.nav-item.active {
background: rgba(62, 175, 124, 0.1);
color: var(--accent-color);
border-left: 3px solid var(--accent-color);
}
.sidebar-footer {
padding: 16px;
border-top: 1px solid var(--border-color);
font-size: 12px;
}
/* Content Area */
.main-content {
flex-grow: 1;
display: flex;
flex-direction: column;
overflow: hidden;
}
header {
height: 64px;
border-bottom: 1px solid var(--border-color);
display: flex;
align-items: center;
padding: 0 24px;
justify-content: space-between;
background: var(--bg-color);
}
.view-title {
font-size: 18px;
font-weight: 600;
}
.content-scroll {
flex-grow: 1;
overflow-y: auto;
padding: 24px;
}
/* Cards & Grids */
.grid {
display: grid;
grid-template-columns: repeat(auto-fill, minmax(350px, 1fr));
gap: 20px;
}
.card {
background: var(--card-color);
border: 1px solid var(--border-color);
padding: 20px;
border-radius: 4px;
position: relative;
}
.card-header {
display: flex;
justify-content: space-between;
align-items: center;
margin-bottom: 16px;
}
.card-title {
font-weight: 700;
font-size: 16px;
}
/* Status Badges */
.badge {
padding: 4px 8px;
border-radius: 4px;
font-size: 11px;
font-weight: 700;
text-transform: uppercase;
}
.status-nominal { background: rgba(62, 175, 124, 0.1); color: var(--nominal); }
.status-degraded { background: rgba(231, 192, 0, 0.1); color: var(--degraded); }
.status-unstable { background: rgba(230, 126, 34, 0.1); color: var(--unstable); }
.status-reconciling { background: rgba(52, 152, 219, 0.1); color: var(--reconciling); }
.status-error { background: rgba(192, 57, 43, 0.1); color: var(--error); }
/* Timeline */
.timeline {
display: flex;
flex-direction: column;
gap: 12px;
}
.event {
padding: 12px;
border-left: 2px solid var(--border-color);
background: rgba(255, 255, 255, 0.02);
font-family: ui-monospace, monospace;
font-size: 13px;
}
.event.high { border-left-color: var(--error); }
.event.medium { border-left-color: var(--unstable); }
.event.low { border-left-color: var(--nominal); }
.event-header {
display: flex;
justify-content: space-between;
margin-bottom: 4px;
color: var(--text-muted);
}
/* Forms & Inputs */
.controls {
display: flex;
gap: 12px;
margin-top: 20px;
}
input, button {
background: var(--card-color);
border: 1px solid var(--border-color);
color: var(--text-color);
padding: 8px 16px;
font-size: 14px;
border-radius: 4px;
}
button {
cursor: pointer;
font-weight: 600;
}
button:hover { background: var(--border-color); }
.btn-primary { background: var(--accent-color); color: white; border: none; }
.btn-primary:hover { background: #359b6d; }
/* Utility */
.hidden { display: none !important; }
.mono { font-family: ui-monospace, monospace; }
.label { color: var(--text-muted); font-size: 12px; margin-bottom: 4px; }
.value { font-weight: 500; margin-bottom: 12px; }
.risk-safe { background: rgba(62, 175, 124, 0.1); color: var(--safe); }
.risk-guarded { background: rgba(230, 126, 34, 0.1); color: var(--guarded); }
.risk-dangerous { background: rgba(192, 57, 43, 0.1); color: var(--dangerous); }
</style>
</head>
<body>
<aside class="sidebar">
<div class="sidebar-header">HOMELAB OPERATOR</div>
<ul class="nav-list">
<li class="nav-item active" onclick="showView('dashboard', this)">
<span>Dashboard</span>
</li>
<li class="nav-item" onclick="showView('actions', this)">
<span>Action Queue</span>
</li>
<li class="nav-item" onclick="showView('nodes', this)">
<span>Nodes</span>
</li>
<li class="nav-item" onclick="showView('services', this)">
<span>Services</span>
</li>
<li class="nav-item" onclick="showView('deployments', this)">
<span>Deployments</span>
</li>
<li class="nav-item" onclick="showView('topology', this)">
<span>Topology</span>
</li>
<li class="nav-item" onclick="showView('events', this)">
<span>Events</span>
</li>
<li class="nav-item" onclick="showView('correlation', this)">
<span>Correlation</span>
</li>
<li class="nav-item" onclick="showView('recommendations', this)">
<span>Recommendations</span>
</li>
<li class="nav-item" onclick="showView('settings', this)">
<span>Settings</span>
</li>
</ul>
<div class="sidebar-footer">
<div id="summary-status">System Status: Loading...</div>
</div>
</aside>
<main class="main-content">
<div id="stale-banner" class="hidden" style="background:var(--error); color:white; padding:8px 24px; font-weight:bold; font-size:12px; text-align:center; letter-spacing:0.05em">
RUNTIME STATE IS STALE
</div>
<header>
<div style="display:flex; align-items:center; gap:20px">
<div class="view-title" id="current-view-title">Dashboard</div>
<select id="operator-mode" onchange="setOperatorMode(this.value)" style="background:var(--sidebar-color); border:1px solid var(--border-color); color:var(--accent-color); font-weight:bold; font-size:12px; padding:4px 8px">
<option value="observe">OBSERVE</option>
<option value="recommend">RECOMMEND</option>
<option value="approval" selected>APPROVAL</option>
<option value="autonomous">AUTONOMOUS</option>
<option value="maintenance">MAINTENANCE</option>
</select>
</div>
<div class="header-actions">
<button onclick="refreshData()">Refresh</button>
</div>
</header>
<div class="content-scroll">
<!-- Dashboard View -->
<div id="view-dashboard" class="view">
<div class="grid">
<div class="card">
<div class="card-title">System Overview</div>
<div id="dashboard-summary" style="margin-top:20px"></div>
</div>
<div class="card">
<div class="card-title">Pending Actions</div>
<div id="dashboard-actions-summary" style="margin-top:20px"></div>
</div>
<div class="card">
<div class="card-title">Active Incidents</div>
<div id="dashboard-incidents" style="margin-top:20px"></div>
</div>
</div>
</div>
<!-- Actions View -->
<div id="view-actions" class="view hidden">
<div style="display:grid; grid-template-columns: 1fr 1fr; gap:24px">
<div>
<h3>Pending Approval</h3>
<div id="actions-pending" class="timeline"></div>
</div>
<div>
<h3>Active / History</h3>
<div id="actions-history" class="timeline"></div>
</div>
</div>
</div>
<!-- Nodes View -->
<div id="view-nodes" class="view hidden">
<div class="grid" id="nodes-list"></div>
</div>
<!-- Services View -->
<div id="view-services" class="view hidden">
<div class="grid" id="services-list"></div>
</div>
<!-- Deployments View -->
<div id="view-deployments" class="view hidden">
<div class="grid" id="deployments-list"></div>
</div>
<!-- Topology View -->
<div id="view-topology" class="view hidden">
<div class="card" style="min-height:500px">
<div class="card-title">Runtime Topology</div>
<div id="topology-map" style="margin-top:20px; display:flex; flex-wrap:wrap; gap:40px; justify-content:center"></div>
</div>
</div>
<!-- Events View -->
<div id="view-events" class="view hidden">
<div class="timeline" id="events-timeline"></div>
</div>
<!-- Correlation View -->
<div id="view-correlation" class="view hidden">
<div id="correlation-chains" class="grid"></div>
</div>
<!-- Recommendations View -->
<div id="view-recommendations" class="view hidden">
<div class="grid" id="recommendations-list"></div>
</div>
<!-- Settings View -->
<div id="view-settings" class="view hidden">
<div class="card">
<div class="card-title">Configuration</div>
<div id="settings-content" style="margin-top:20px"></div>
</div>
</div>
</div>
</main>
<script>
let currentView = 'dashboard';
const pollInterval = 5000;
function showView(viewId, el) {
document.querySelectorAll('.view').forEach(v => v.classList.add('hidden'));
document.getElementById('view-' + viewId).classList.remove('hidden');
document.querySelectorAll('.nav-item').forEach(i => i.classList.remove('active'));
if (el) el.classList.add('active');
currentView = viewId;
document.getElementById('current-view-title').textContent = viewId.charAt(0).toUpperCase() + viewId.slice(1);
refreshData();
}
async function fetchData(endpoint) {
try {
const res = await fetch(endpoint, {cache: 'no-store'});
return await res.json();
} catch (e) {
console.error('Fetch error:', endpoint, e);
return null;
}
}
async function postData(endpoint, data) {
try {
const res = await fetch(endpoint, {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify(data)
});
return await res.json();
} catch (e) {
console.error('Post error:', endpoint, e);
return null;
}
}
async function mutateAction(id, status) {
const res = await postData('/action/mutate', {id, status});
if (res && res.status === 'ok') {
refreshData();
} else {
alert('Mutation failed');
}
}
async function setOperatorMode(mode) {
console.log('Operator mode set to:', mode);
const res = await postData('/mode', {mode});
if (res && res.status === 'ok') {
console.log('Mode updated successfully');
}
}
function formatTime(ts) {
if (!ts) return 'N/A';
return new Date(ts * 1000).toLocaleString();
}
function getStatusClass(status) {
status = (status || '').toLowerCase();
if (['nominal', 'healthy', 'ok', 'up'].includes(status)) return 'status-nominal';
if (['degraded', 'warning'].includes(status)) return 'status-degraded';
if (['unstable'].includes(status)) return 'status-unstable';
if (['reconciling'].includes(status)) return 'status-reconciling';
if (['error', 'down', 'failed'].includes(status)) return 'status-error';
return '';
}
async function refreshData() {
// Refresh summary always
const summary = await fetchData('/summary');
if (summary) {
const statusEl = document.getElementById('summary-status');
statusEl.textContent = `System Status: ${summary.status.toUpperCase()}`;
statusEl.className = 'sidebar-footer ' + getStatusClass(summary.status);
// Handle stale state
const staleBanner = document.getElementById('stale-banner');
if (summary.stale) {
staleBanner.classList.remove('hidden');
staleBanner.textContent = `CRITICAL: Runtime state is STALE (Last update: ${formatTime(summary.last_update)})`;
} else {
staleBanner.classList.add('hidden');
}
if (currentView === 'dashboard') {
const dashSummary = document.getElementById('dashboard-summary');
dashSummary.innerHTML = `
<div class="label">Nodes</div><div class="value">${summary.node_count}</div>
<div class="label">Services</div><div class="value">${summary.service_count}</div>
<div class="label">Last Update</div><div class="value">${formatTime(summary.last_update)}</div>
`;
}
}
if (currentView === 'dashboard' || currentView === 'actions') {
const actions = await fetchData('/actions');
if (actions) {
if (currentView === 'dashboard') {
const dashActions = document.getElementById('dashboard-actions-summary');
const pendingCount = actions.pending.length;
dashActions.innerHTML = `
<div class="label">Pending</div><div class="value" style="color:var(--guarded)">${pendingCount}</div>
<div class="label">Running</div><div class="value" style="color:var(--reconciling)">${actions.running.length}</div>
`;
}
if (currentView === 'actions') {
const pendingEl = document.getElementById('actions-pending');
const historyEl = document.getElementById('actions-history');
pendingEl.innerHTML = actions.pending.map(a => `
<div class="card" style="margin-bottom:12px">
<div class="card-header">
<div class="card-title">${(a.action_type || a.type || 'unknown').toUpperCase()}</div>
<span class="badge risk-${a.risk_level}">${a.risk_level}</span>
</div>
<p>${a.description || a.action_type || 'No description'}</p>
<div class="label">Target</div><div class="value">${a.node || (a.target && a.target.node) || 'unknown'} ${(a.service || (a.target && a.target.service)) || ''}</div>
<div class="label">Confidence</div><div class="value">${Math.round((a.confidence || 0)*100)}%</div>
<div class="controls">
<button class="btn-primary" onclick="mutateAction('${a.id}', 'approved')">Approve</button>
<button onclick="mutateAction('${a.id}', 'rejected')">Reject</button>
</div>
</div>
`).join('') || 'No pending actions.';
const history = [...actions.approved, ...actions.running, ...actions.completed, ...actions.failed, ...actions.rejected];
historyEl.innerHTML = history.sort((a,b) => (b.timestamp || b.updated_at || 0) - (a.timestamp || a.updated_at || 0)).map(a => `
<div class="event">
<div class="event-header">
<span>${(a.action_type || a.type || 'unknown').toUpperCase()}</span>
<span class="badge ${getStatusClass(a.status)}">${a.status}</span>
</div>
<div>${a.description || a.action_type || 'No description'}</div>
<small>${formatTime(a.timestamp || a.updated_at)} | Target: ${a.node || (a.target && a.target.node)}</small>
${a.status === 'approved' ? `<div class="controls"><button class="btn-primary" onclick="mutateAction('${a.id}', 'running')">Execute</button></div>` : ''}
${a.transition_history ? `
<div style="margin-top:8px; font-size:10px; color:var(--text-muted)">
<strong>Trace:</strong> ${a.transition_history.map(h => `${h.from}->${h.to}`).join(' → ')}
</div>
` : ''}
</div>
`).join('') || 'No history.';
}
}
}
if (currentView === 'dashboard' || currentView === 'events') {
const incidents = await fetchData('/incidents');
if (currentView === 'dashboard') {
const dashIncidents = document.getElementById('dashboard-incidents');
if (!incidents || incidents.length === 0) {
dashIncidents.textContent = 'No active incidents.';
} else {
dashIncidents.innerHTML = incidents.map(inc => `
<div class="event ${inc.severity}">
<strong>${inc.severity.toUpperCase()}:</strong> ${inc.message}<br>
<small>${formatTime(inc.timestamp)} | Node: ${inc.node}</small>
</div>
`).join('');
}
}
}
if (currentView === 'nodes') {
const nodes = await fetchData('/nodes');
const list = document.getElementById('nodes-list');
list.innerHTML = nodes.map(node => `
<div class="card">
<div class="card-header">
<div class="card-title">${node.hostname}</div>
<span class="badge ${getStatusClass(node.health)}">${node.health}</span>
</div>
<div class="label">ID</div><div class="value mono">${node.id}</div>
<div class="label">Capabilities</div><div class="value">${node.capabilities.join(', ')}</div>
<div class="label">Connectivity</div><div class="value">${node.connectivity}</div>
<div class="label">Incidents (24h)</div><div class="value">${node.incidents}</div>
<div class="label">Last Seen</div><div class="value">${formatTime(node.last_seen)}</div>
<div class="label">Runtime Status</div><div class="value">${node.status}</div>
</div>
`).join('');
}
if (currentView === 'services') {
const services = await fetchData('/services');
const list = document.getElementById('services-list');
list.innerHTML = services.map(svc => `
<div class="card">
<div class="card-header">
<div class="card-title">${svc.name}</div>
<span class="badge ${getStatusClass(svc.health)}">${svc.health}</span>
</div>
<div class="label">State (Desired/Actual)</div><div class="value">${svc.desired_state} / ${svc.actual_state}</div>
<div class="label">Deployment</div><div class="value">${svc.deployment_state}</div>
<div class="label">Dependencies</div><div class="value">${svc.dependencies.join(', ') || 'None'}</div>
<div class="label">Recommendations</div><div class="value">${svc.recommendations.join(', ') || 'None'}</div>
</div>
`).join('');
}
if (currentView === 'deployments') {
const deps = await fetchData('/deployments');
const list = document.getElementById('deployments-list');
list.innerHTML = deps.map(dep => `
<div class="card">
<div class="card-header">
<div class="card-title">${dep.service}</div>
<span class="badge ${dep.status === 'failed' ? 'status-error' : 'status-reconciling'}">${dep.status}</span>
</div>
<div class="label">ID</div><div class="value mono">${dep.id}</div>
<div class="label">Stage</div><div class="value">${dep.stage}</div>
<div class="label">Diagnostics</div><div class="value">${dep.diagnostics || 'No data'}</div>
<div class="label">Resumable</div><div class="value">${dep.resumable ? 'Yes' : 'No'}</div>
${dep.resumable ? '<button class="btn-primary">Resume</button>' : ''}
</div>
`).join('');
}
if (currentView === 'events') {
const events = await fetchData('/events');
const timeline = document.getElementById('events-timeline');
timeline.innerHTML = events.map(ev => `
<div class="event ${ev.severity}">
<div class="event-header">
<span>${ev.type.toUpperCase()}</span>
<span>${formatTime(ev.timestamp)}</span>
</div>
<div>${ev.message}</div>
<div class="label" style="margin-top:8px">Node: ${ev.node} ${ev.service ? '| Service: ' + ev.service : ''}</div>
</div>
`).join('');
}
if (currentView === 'recommendations') {
const recs = await fetchData('/recommendations');
const list = document.getElementById('recommendations-list');
list.innerHTML = recs.map(rec => `
<div class="card">
<div class="card-header">
<div class="card-title">${rec.title}</div>
<span class="badge risk-${rec.risk_level}">${rec.risk_level}</span>
</div>
<p>${rec.description}</p>
<div class="label">Confidence</div><div class="value">${Math.round(rec.confidence * 100)}%</div>
<div class="label">Autonomous Eligible</div><div class="value">${rec.autonomous_eligible ? 'Yes' : 'No'}</div>
<div class="label">Blocked Actions</div><div class="value">${rec.blocked_actions.join(', ') || 'None'}</div>
<div class="controls">
<button class="btn-primary" ${rec.risk_level === 'dangerous' ? 'style="background:var(--dangerous)"' : ''}>Approve Action</button>
</div>
</div>
`).join('');
}
if (currentView === 'topology') {
const nodes = await fetchData('/nodes');
const services = await fetchData('/services');
const topMap = document.getElementById('topology-map');
if (nodes && services) {
topMap.innerHTML = nodes.map(node => {
const nodeServices = services.filter(s => s.node === node.hostname || s.node === node.id);
return `
<div class="card" style="width:250px; border: 1px solid ${node.health === 'nominal' ? 'var(--border-color)' : 'var(--error)'}">
<div class="card-header">
<div class="card-title">${node.hostname}</div>
<span class="badge ${getStatusClass(node.health)}">${node.health}</span>
</div>
<div class="label">Capabilities</div>
<div class="value" style="font-size:11px">${node.capabilities.join(', ')}</div>
<div class="label">Services</div>
<div style="font-size:12px; margin-bottom:10px">
${nodeServices.length > 0 ? nodeServices.map(s => `
<div style="display:flex; justify-content:space-between; margin-bottom:4px; padding:4px; background:rgba(255,255,255,0.03)">
<span>${s.name}</span>
<span class="${getStatusClass(s.health)}" style="font-size:10px">${s.health}</span>
</div>
${s.dependencies.length > 0 ? `<div style="font-size:9px; color:var(--text-muted); margin-left:8px; margin-bottom:4px">dep: ${s.dependencies.join(', ')}</div>` : ''}
`).join('') : '<div class="value">None</div>'}
</div>
</div>
`;
}).join('');
}
}
if (currentView === 'correlation') {
const incidents = await fetchData('/incidents');
const actions = await fetchData('/actions');
const list = document.getElementById('correlation-chains');
if (incidents && actions) {
const allActions = Object.values(actions).flat();
list.innerHTML = incidents.map(inc => {
const related = allActions.filter(a => a.correlation_chain && a.correlation_chain.includes(inc.id));
return `
<div class="card">
<div class="card-header">
<div class="card-title">Incident: ${inc.id || 'INC-001'}</div>
<span class="badge status-error">Active</span>
</div>
<p>${inc.message}</p>
<div class="label">Related Actions</div>
${related.map(a => `
<div class="event" style="margin-top:5px">
<strong>${a.type}</strong> (${a.status})<br>
<small>${a.description}</small>
</div>
`).join('') || '<div class="value">No actions yet</div>'}
</div>
`;
}).join('');
}
}
if (currentView === 'settings') {
const config = await fetchData('/config');
const content = document.getElementById('settings-content');
content.innerHTML = `
<div class="label">Auto Mode</div>
<div class="value">${config.auto_mode ? 'Enabled' : 'Disabled'}</div>
<div class="label">Action Thresholds</div>
<div class="value mono">${JSON.stringify(config.action_thresholds, null, 2)}</div>
<div class="label">Telegram Integration</div>
<div class="value" style="color:var(--text-muted)">Ready for mobile approval flows. Hook: /api/v1/telegram/webhook</div>
<button onclick="alert('Settings update not implemented in this demo')">Edit Configuration</button>
`;
}
}
// Initial load
refreshData();
// Poll for updates
setInterval(refreshData, pollInterval);
</script>
</body>
</html>

View file

@ -0,0 +1,426 @@
import heapq
import json
import os
import re
import time
from datetime import datetime
from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
from pathlib import Path
STATE_DIR = Path(os.getenv("HOMELAB_STATE_ROOT", "/opt/homelab/state"))
EVENTS_DIR = Path(os.getenv("HOMELAB_EVENTS_ROOT", "/opt/homelab/events"))
WORLD_DIR = Path(os.getenv("HOMELAB_WORLD_ROOT", "/opt/homelab/world"))
ACTIONS_DIR = Path(os.getenv("HOMELAB_ACTIONS_ROOT", "/opt/homelab/actions"))
CONFIG_DIR = Path(os.getenv("HOMELAB_CONFIG_ROOT", "/opt/homelab/config"))
STATIC_DIR = Path(__file__).parent
_EVENT_TS_RE = re.compile(r"-(\d{9,11})-")
DEFAULT_CONFIG = {
"operator_mode": "approval",
"auto_mode": True,
"action_thresholds": {
"restart_ha": 0.8,
"check_network": 0.9,
},
"default_threshold": 0.9,
"allowed_auto_actions": ["restart_ha"],
}
def read_json_file(path, default=None):
if not path.exists():
return default if default is not None else []
try:
return json.loads(path.read_text())
except Exception:
return default if default is not None else []
def get_config():
config_path = STATE_DIR / "operator-config.json"
if config_path.exists():
return read_json_file(config_path, DEFAULT_CONFIG)
return DEFAULT_CONFIG
def save_config(config):
STATE_DIR.mkdir(parents=True, exist_ok=True)
(STATE_DIR / "operator-config.json").write_text(json.dumps(config, indent=2))
EVENTS_MAX_AGE_HOURS = int(os.getenv("EVENTS_MAX_AGE_HOURS", "24"))
EVENTS_MAX_COUNT = int(os.getenv("EVENTS_MAX_COUNT", "200"))
def _node_health(info):
status = info.get("status", "unknown")
if status == "offline":
return "error"
if info.get("disk_pressure") == "high":
return "degraded"
if status == "online":
return "nominal"
return status
def current_nodes():
"""Return nodes as a list of dicts shaped for the UI.
The observer stores nodes as a keyed dict {node_name: {...}}. The frontend
calls .map() which requires an array, so we convert here rather than change
the on-disk format (which the supervisor also reads).
"""
raw = read_json_file(WORLD_DIR / "nodes.json", default={})
if isinstance(raw, list):
return raw
result = []
for name, info in raw.items():
result.append({
"id": name,
"hostname": name,
"health": _node_health(info),
"status": info.get("status", "unknown"),
"capabilities": info.get("roles", []),
"connectivity": "tailscale",
"incidents": 0,
"last_seen": info.get("last_seen"),
"disk_usage_pct": info.get("disk_usage_pct"),
"mem_usage_pct": info.get("mem_usage_pct"),
"cpu_usage_pct": info.get("cpu_usage_pct"),
"disk_pressure": info.get("disk_pressure"),
})
return result
def current_services():
"""Return services as a list of dicts shaped for the UI.
Observer stores services as {"node/service": {...}}. Converted to a list
with the fields the services and topology views expect.
"""
raw = read_json_file(WORLD_DIR / "services.json", default={})
if isinstance(raw, list):
return raw
result = []
for key, info in raw.items():
svc_status = info.get("status", "unknown")
result.append({
"id": key,
"name": info.get("service", key),
"node": info.get("node", ""),
"health": ("nominal" if svc_status == "healthy"
else ("error" if svc_status == "unhealthy"
else svc_status)),
"desired_state": "running",
"actual_state": svc_status,
"deployment_state": "deployed",
"dependencies": [],
"recommendations": [],
"last_check": info.get("last_check"),
"incident_id": info.get("incident_id"),
})
return result
def current_deployments():
"""Return deployments as a list sorted newest-first."""
raw = read_json_file(WORLD_DIR / "deployments.json", default={})
if isinstance(raw, list):
return raw
result = []
for dep_id, info in raw.items():
result.append({
"id": dep_id,
"service": info.get("service", ""),
"node": info.get("node", ""),
"status": info.get("status", "unknown"),
"stage": info.get("status", "unknown"),
"diagnostics": info.get("last_error", ""),
"resumable": info.get("status") == "failed",
"started_at": info.get("started_at"),
"finished_at": info.get("finished_at"),
})
return sorted(result, key=lambda x: x.get("started_at") or 0, reverse=True)
def current_incidents():
"""Return active incidents as a list sorted most-recent-first.
Only incidents with status='active' are returned; resolved and cancelled
records are excluded so the dashboard reflects the current operational state.
"""
raw = read_json_file(WORLD_DIR / "incidents.json", default={})
if isinstance(raw, list):
return [i for i in raw if i.get("status") == "active"]
result = []
for inc in raw.values():
if inc.get("status") != "active":
continue
# Synthesise a human-readable message if not stored (observer doesn't set one).
if "message" not in inc:
inc = dict(inc)
inc["message"] = (
f"{inc.get('service', '?')} on {inc.get('node', '?')} "
f"is {inc.get('trigger_type', 'unhealthy')}"
)
result.append(inc)
return sorted(result, key=lambda x: x.get("last_occurrence") or 0, reverse=True)
def current_recommendations():
return read_json_file(WORLD_DIR / "recommendations.json")
def current_summary():
path = WORLD_DIR / "runtime-summary.json"
summary = read_json_file(path, default={})
if summary:
last_update_val = summary.get("last_update")
if last_update_val:
try:
if isinstance(last_update_val, str):
last_update = datetime.fromisoformat(last_update_val.replace('Z', '+00:00')).timestamp()
else:
last_update = float(last_update_val)
except Exception:
last_update = os.path.getmtime(path)
else:
last_update = os.path.getmtime(path)
summary["last_update"] = last_update
summary["stale"] = (time.time() - last_update) > 60
return summary
def _event_file_ts(p: Path) -> int:
"""Extract epoch timestamp from event filename: evt-<node>-<ts>-<type>-<svc>.json"""
m = _EVENT_TS_RE.search(p.stem)
return int(m.group(1)) if m else 0
def current_events():
"""Return the EVENTS_MAX_COUNT most-recent events, sorted newest-first.
Event files are named evt-<node>-<epoch>-<type>-<svc>.json. The directory
can contain hundreds of thousands of files (one file per event, written by
node-agent). Loading every file on each request causes catastrophic RSS
growth 242 k files 420 MB of Python objects + 100 MB JSON serialisation.
Fix: use heapq.nlargest to stream through file paths (O(N_files) time,
O(EVENTS_MAX_COUNT) memory), extracting the epoch from the filename without
opening any file. Only the winning EVENTS_MAX_COUNT files are then read.
"""
if not EVENTS_DIR.exists():
return []
cutoff = time.time() - EVENTS_MAX_AGE_HOURS * 3600
# Stream all paths through a max-heap — never materialises the full list.
candidates = heapq.nlargest(
EVENTS_MAX_COUNT,
EVENTS_DIR.glob("**/*.json"),
key=_event_file_ts,
)
events = []
for f in candidates:
data = read_json_file(f)
if data and (data.get("timestamp") or 0) > cutoff:
data["_source"] = f.name
events.append(data)
return sorted(events, key=lambda x: x.get("timestamp") or 0, reverse=True)
def current_actions():
actions = {}
statuses = ["pending", "approved", "running", "completed", "failed", "rejected"]
for status in statuses:
actions[status] = []
status_dir = ACTIONS_DIR / status
if status_dir.exists():
for f in status_dir.glob("*.json"):
data = read_json_file(f)
if data:
# Injects some metadata for UI
data["id"] = data.get("action_id") or f.stem
data["status"] = status
actions[status].append(data)
return actions
def mutate_action(action_id, target_status):
statuses = ["pending", "approved", "running", "completed", "failed", "rejected"]
if target_status not in statuses:
return False, f"Invalid target status: {target_status}"
# Find where the action is
source_path = None
current_status = None
for status in statuses:
p = ACTIONS_DIR / status / f"{action_id}.json"
if p.exists():
source_path = p
current_status = status
break
if not source_path:
return False, f"Action {action_id} not found"
target_dir = ACTIONS_DIR / target_status
target_dir.mkdir(parents=True, exist_ok=True)
target_path = target_dir / f"{action_id}.json"
try:
data = json.loads(source_path.read_text())
data["status"] = target_status
data["updated_at"] = time.time()
# Keep history of transitions
history = data.get("transition_history", [])
history.append({
"from": current_status,
"to": target_status,
"timestamp": time.time()
})
data["transition_history"] = history
target_path.write_text(json.dumps(data, indent=2))
if source_path != target_path:
source_path.unlink()
return True, "Success"
except Exception as e:
return False, str(e)
def send_json(status, payload, handler):
body = (json.dumps(payload) + "\n").encode("utf-8")
handler.send_response(status)
handler.send_header("Content-Type", "application/json")
handler.send_header("Content-Length", str(len(body)))
handler.end_headers()
handler.wfile.write(body)
class Handler(BaseHTTPRequestHandler):
def do_GET(self):
if self.path == "/config":
send_json(200, get_config(), self)
return
if self.path == "/nodes":
send_json(200, current_nodes(), self)
return
if self.path == "/services":
send_json(200, current_services(), self)
return
if self.path == "/deployments":
send_json(200, current_deployments(), self)
return
if self.path == "/incidents":
send_json(200, current_incidents(), self)
return
if self.path == "/recommendations":
send_json(200, current_recommendations(), self)
return
if self.path == "/summary":
send_json(200, current_summary(), self)
return
if self.path == "/events":
send_json(200, current_events(), self)
return
if self.path == "/actions":
send_json(200, current_actions(), self)
return
if self.path in ("/", "/index.html"):
body = (STATIC_DIR / "index.html").read_bytes()
self.send_response(200)
self.send_header("Content-Type", "text/html; charset=utf-8")
self.send_header("Content-Length", str(len(body)))
self.end_headers()
self.wfile.write(body)
return
self.send_error(404)
def do_POST(self):
if self.path not in (
"/config",
"/action/mutate",
"/mode",
):
self.send_error(404)
return
length = int(self.headers.get("Content-Length", "0"))
raw_body = self.rfile.read(length).decode("utf-8")
try:
payload = json.loads(raw_body)
except json.JSONDecodeError:
self.send_error(400, "Invalid JSON")
return
if self.path == "/config":
config = get_config()
config.update(payload)
save_config(config)
send_json(200, {"status": "ok"}, self)
return
if self.path == "/mode":
mode = payload.get("mode")
if not mode:
self.send_error(400, "mode is required")
return
config = get_config()
config["operator_mode"] = mode
save_config(config)
send_json(200, {"status": "ok"}, self)
return
if self.path == "/action/mutate":
action_id = payload.get("id")
target = payload.get("status")
if not action_id or not target:
self.send_error(400, "id and status are required")
return
success, msg = mutate_action(action_id, target)
if success:
send_json(200, {"status": "ok"}, self)
else:
self.send_error(500, msg)
return
def log_message(self, format, *args):
return
class OperatorHTTPServer(ThreadingHTTPServer):
# Use daemon threads so finished request threads do not accumulate in the
# internal _threads list. ThreadingMixIn only tracks non-daemon threads
# (for joining at server_close); with daemon_threads=True that list stays
# empty, preventing unbounded growth of dead Thread objects over time.
daemon_threads = True
if __name__ == "__main__":
# Ensure directories exist
for d in [STATE_DIR, EVENTS_DIR, WORLD_DIR, ACTIONS_DIR, CONFIG_DIR]:
d.mkdir(parents=True, exist_ok=True)
for s in ["pending", "approved", "running", "completed", "failed", "rejected"]:
(ACTIONS_DIR / s).mkdir(parents=True, exist_ok=True)
port = int(os.getenv("PORT", "8080"))
print(f"Operator Control Plane starting on 0.0.0.0:{port}")
server = OperatorHTTPServer(("0.0.0.0", port), Handler)
server.serve_forever()

View file

@ -0,0 +1,771 @@
import os
import json
import time
import logging
import yaml
from pathlib import Path
def _atomic_write_json(path: Path, data) -> None:
"""Write JSON atomically: write to a sibling .tmp, fsync, then os.replace."""
tmp = path.with_suffix(".tmp")
with open(tmp, "w") as f:
json.dump(data, f, indent=2)
f.flush()
os.fsync(f.fileno())
os.replace(tmp, path)
# Constants and Paths
RUNTIME_PATH = os.getenv("RUNTIME_PATH", "/opt/homelab")
WORLD_DIR = Path(RUNTIME_PATH) / "world"
ACTIONS_DIR = Path(RUNTIME_PATH) / "actions"
EVENTS_DIR = Path(RUNTIME_PATH) / "events"
REPO_ROOT = Path(os.getenv("REPO_ROOT", "/repo"))
# Node alias map: maps alternative node names (as they appear in events/world state)
# to canonical topology node names (as they appear in hosts/*/services.yaml and topology.yaml).
# Override at runtime via NODE_ALIAS_MAP env var as a JSON string, e.g.:
# NODE_ALIAS_MAP='{"node-2": "chelsty", "node-1": "piha"}'
_NODE_ALIAS_ENV = os.getenv("NODE_ALIAS_MAP", "{}")
try:
NODE_ALIAS_MAP = json.loads(_NODE_ALIAS_ENV)
except Exception:
NODE_ALIAS_MAP = {}
# Event trigger types that should result in a lightweight container_restart
# rather than a full redeploy. The container is present but not running,
# or a dependency (MQTT) is unreachable — a restart is the right first step.
CONTAINER_RESTART_TRIGGERS = {"containers_not_running", "mqtt_unreachable"}
# Nodes where automatic disk_cleanup actions must NOT be generated.
# On chelsty nodes disk fullness is overwhelmingly caused by Frigate recordings
# or the HA database — Docker cleanup will not help and the operator must
# decide explicitly (e.g. adjust Frigate retain policy or purge HA recorder).
NO_DISK_CLEANUP_NODES = {"chelsty-infra", "chelsty-ha"}
# ---------------------------------------------------------------------------
# HA diagnostic event routing (ha-diag-agent events)
# ---------------------------------------------------------------------------
# ha_websocket_dead: HA WebSocket unresponsive → restart the homeassistant container.
# Separate from CONTAINER_RESTART_TRIGGERS because these events are routed directly
# from the events dir (not via the world-state drift loop) to avoid conflicts with
# the stability-agent's independent container health tracking on the same service key.
HA_CONTAINER_RESTART_EVENTS = {"ha_websocket_dead"}
# Alert-only events — operator notification, no automated action.
HA_ALERT_ONLY_EVENTS = {
"ha_integration_failed",
"ha_entity_unavailable_long",
"ha_automation_failing",
"ha_update_available",
"ha_recorder_lag",
"ha_system_health_degraded",
}
# Stable action-ID suffix for each alert-only type
_HA_ALERT_ID_SUFFIX = {
"ha_integration_failed": "integration-failed",
"ha_entity_unavailable_long": "entity-unavailable",
"ha_automation_failing": "automation-failing",
"ha_update_available": "update-available",
"ha_recorder_lag": "recorder-lag",
"ha_system_health_degraded": "system-health-degraded",
}
# 30-min cooldown after a container_restart completes; prevents restart loops
# when HA repeatedly fails to connect (e.g. bad config, slow startup).
HA_WEBSOCKET_RESTART_COOLDOWN = 1800
# 1-hour cooldown for alert-only events; avoids repeated Telegram noise for
# persistent conditions (e.g. an entity that stays unavailable for hours).
HA_ALERT_COOLDOWN = 3600
# Suppress ha_* events if homeassistant had a containers_not_running incident
# within this window — HA is in a planned restart/update and alerts would be noise.
HA_TRANSITION_WINDOW = 300 # 5 minutes
# When True, events that would generate container_restart are downgraded to alert_only
# with a "[SHADOW MODE]" note. Safe default for initial deployment; set
# HA_DIAG_SHADOW_MODE=false on the control-plane node when ready for live actions.
HA_DIAG_SHADOW_MODE = os.getenv("HA_DIAG_SHADOW_MODE", "true").lower() == "true"
# Logging setup
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger("supervisor")
class Supervisor:
def __init__(self):
self.desired_state = {"services": {}}
self.actual_state = {"services": {}, "nodes": {}, "incidents": {}}
# In-memory set of already-routed HA event IDs; prevents re-processing
# on each reconcile cycle. Grows to at most ~hundreds of entries/day.
self._ha_processed_event_ids: set = set()
self._ensure_dirs()
logger.info(
"shadow_mode=%s — HA container_restart actions %s",
HA_DIAG_SHADOW_MODE,
"downgraded to alert_only" if HA_DIAG_SHADOW_MODE else "enabled",
)
def _ensure_dirs(self):
ACTIONS_DIR.mkdir(parents=True, exist_ok=True)
(ACTIONS_DIR / "pending").mkdir(parents=True, exist_ok=True)
# ------------------------------------------------------------------
# Node name resolution
# ------------------------------------------------------------------
def _resolve_node(self, name):
"""Resolve an event/world-state node name to its canonical topology name."""
return NODE_ALIAS_MAP.get(name, name)
# ------------------------------------------------------------------
# Container name lookup
# ------------------------------------------------------------------
def _get_container_name(self, service):
"""
Determine the Docker container name for a service.
Parses container_name from the service's docker-compose.yml.
Falls back to the service name if not found.
"""
compose_path = REPO_ROOT / "services" / service / "docker-compose.yml"
if compose_path.exists():
try:
with open(compose_path, "r") as f:
compose = yaml.safe_load(f)
for svc_block in compose.get("services", {}).values():
cname = svc_block.get("container_name")
if cname:
return cname
except Exception as e:
logger.warning(f"Could not parse docker-compose for {service}: {e}")
# Convention: container name matches service name
return service
# ------------------------------------------------------------------
# State loading
# ------------------------------------------------------------------
def _load_desired_state(self):
services = {}
hosts_dir = REPO_ROOT / "hosts"
if not hosts_dir.exists():
logger.warning(f"Hosts directory {hosts_dir} does not exist")
return
for host_dir in hosts_dir.iterdir():
if host_dir.is_dir():
svc_file = host_dir / "services.yaml"
if svc_file.exists():
try:
with open(svc_file, "r") as f:
data = yaml.safe_load(f)
host_name = data.get("host")
for svc_name, svc_info in data.get("services", {}).items():
svc_info = svc_info or {}
# monitor: false — service is documented as desired but
# intentionally excluded from supervisor action generation.
# Use this when a service is not yet bootstrapped on an
# offline/LTE node so the queue stays clean until it is.
if svc_info.get("monitor") is False:
logger.debug(
f"Skipping {host_name}/{svc_name}: monitor=false"
)
continue
svc_key = f"{host_name}/{svc_name}"
services[svc_key] = {
"node": host_name,
"service": svc_name,
"desired": "running"
}
except Exception as e:
logger.error(f"Failed to load {svc_file}: {e}")
self.desired_state["services"] = services
def _load_actual_state(self) -> bool:
"""Load world state from disk. Returns False if any file is unreadable
(empty / mid-write truncation), in which case actual_state is NOT updated
so the caller can skip this reconcile cycle rather than treating missing
data as a real drift signal."""
files = {
"services": WORLD_DIR / "services.json",
"nodes": WORLD_DIR / "nodes.json",
"incidents": WORLD_DIR / "incidents.json"
}
raw = {}
for key, path in files.items():
if path.exists():
try:
with open(path, "r") as f:
raw[key] = json.load(f)
except Exception as e:
logger.warning(
f"World state {path.name} unreadable (truncated write?): {e} "
f"— skipping reconcile cycle, keeping last known state"
)
return False
else:
raw[key] = {}
# Normalize node names in services using alias map so that
# event-sourced names (e.g. "node-2") resolve to canonical
# topology names (e.g. "chelsty") before comparison with desired state.
normalized_services = {}
for svc_key, svc_info in raw.get("services", {}).items():
svc_info = dict(svc_info)
raw_node = svc_info.get("node", "")
canonical_node = self._resolve_node(raw_node)
if canonical_node != raw_node:
logger.debug(f"Resolved node alias: {raw_node}{canonical_node}")
svc_info["node"] = canonical_node
svc_name = svc_info.get("service") or svc_key.split("/", 1)[-1]
svc_key = f"{canonical_node}/{svc_name}"
normalized_services[svc_key] = svc_info
# Normalize node names in incidents as well
normalized_incidents = {}
for inc_id, inc in raw.get("incidents", {}).items():
inc = dict(inc)
raw_node = inc.get("node", "")
inc["node"] = self._resolve_node(raw_node)
normalized_incidents[inc_id] = inc
self.actual_state["services"] = normalized_services
self.actual_state["nodes"] = raw.get("nodes", {})
self.actual_state["incidents"] = normalized_incidents
return True
# ------------------------------------------------------------------
# Incident helpers
# ------------------------------------------------------------------
def _get_incident_trigger(self, svc_key):
"""
Return the trigger_type of the active incident for a service, or None.
trigger_type is set by the observer when it creates an incident from
a specific event type (e.g. 'containers_not_running', 'mqtt_unreachable').
"""
svc_info = self.actual_state["services"].get(svc_key, {})
incident_id = svc_info.get("incident_id")
if not incident_id:
return None
incident = self.actual_state["incidents"].get(incident_id, {})
if incident.get("status") == "active":
return incident.get("trigger_type")
return None
# ------------------------------------------------------------------
# Reconciliation loop
# ------------------------------------------------------------------
def reconcile(self):
# Update heartbeat
heartbeat_file = WORLD_DIR.parent / "state" / "supervisor.heartbeat"
try:
heartbeat_file.touch()
except Exception as e:
logger.error(f"Failed to touch heartbeat file: {e}")
self._load_desired_state()
if not self._load_actual_state():
return # world state unreadable this cycle — skip to avoid false drift
drifts = []
# 1. Check for missing or unhealthy services
for svc_key, desired_info in self.desired_state["services"].items():
actual_info = self.actual_state["services"].get(svc_key)
if not actual_info:
drifts.append({
"type": "missing_service",
"svc_key": svc_key,
"node": desired_info["node"],
"service": desired_info["service"],
"trigger_type": None,
})
elif actual_info.get("status") != "healthy":
trigger_type = self._get_incident_trigger(svc_key)
drifts.append({
"type": "unhealthy_service",
"svc_key": svc_key,
"node": desired_info["node"],
"service": desired_info["service"],
"status": actual_info.get("status"),
"trigger_type": trigger_type,
})
# 2. Generate service-level recommendations
for drift in drifts:
self._generate_recommendation(drift)
# 3. Generate node-level recommendations (disk pressure)
for node_name, node_info in self.actual_state["nodes"].items():
if node_name in NO_DISK_CLEANUP_NODES:
continue
if node_info.get("disk_pressure") == "high":
self._generate_disk_cleanup_recommendation(node_name)
# 4. Cancel pending actions whose drift has been resolved.
# When a service becomes healthy again (because node-agent emits
# service_healthy and the observer updates services.json), any
# previously queued redeploy/container_restart action for that
# service is no longer needed. Move it to "cancelled/" so the
# operator can see it was auto-resolved rather than silently dropped.
self._cancel_resolved_pending_actions()
# 5. Route HA diagnostic events emitted by ha-diag-agent.
# Processed directly from the events directory — not via the world-state
# drift loop — to avoid conflicts with stability-agent's independent
# container health tracking for the homeassistant service.
self._process_ha_events()
# ------------------------------------------------------------------
# Recommendation generation
# ------------------------------------------------------------------
def _generate_recommendation(self, drift):
node = drift["node"]
service = drift["service"]
trigger_type = drift.get("trigger_type")
# Choose action type first so we can build the stable, deterministic ID.
# Stable IDs mean reconcile is truly idempotent: the same drift always
# produces the same filename, so we never create duplicates even across
# restarts of the supervisor.
if trigger_type in CONTAINER_RESTART_TRIGGERS:
action_id = f"container-restart-{node}-{service}"
else:
action_id = f"redeploy-{node}-{service}"
# Skip if an action for this ID is already live in any active state
# (pending → approved → running). This prevents re-creation after
# a human approves an action that hasn't executed yet.
for state in ("pending", "approved", "running"):
if (ACTIONS_DIR / state / f"{action_id}.json").exists():
logger.debug(f"Skipping {action_id}: already in state '{state}'")
return
if trigger_type in CONTAINER_RESTART_TRIGGERS:
# Lightweight remediation: the container exists but is not running
# (containers_not_running) or its MQTT dependency is unreachable
# (mqtt_unreachable). A docker restart is sufficient and low-risk.
container_name = self._get_container_name(service)
action = {
"action_id": action_id,
"timestamp": time.time(),
"type": "container_restart",
"node": node,
"service": service,
"container_name": container_name,
"risk_level": "low",
"confidence": 0.95,
"description": (
f"Restart container '{container_name}' on {node} "
f"(service: {service}, reason: {trigger_type})"
),
"status": "pending",
"payload": {
"reason": trigger_type,
"svc_key": drift["svc_key"],
},
}
else:
# Full redeploy: container is running but service is broken,
# or the cause is unknown / not a simple restart candidate.
action = {
"action_id": action_id,
"timestamp": time.time(),
"type": "redeploy",
"node": node,
"service": service,
"risk_level": "guarded",
"confidence": 0.9,
"description": f"Redeploy {service} on {node} due to {drift['type']}",
"status": "pending",
"payload": {
"reason": drift["type"],
"svc_key": drift["svc_key"],
},
}
action_path = ACTIONS_DIR / "pending" / f"{action_id}.json"
try:
_atomic_write_json(action_path, action)
logger.info(
f"Generated recommendation: {action_id} "
f"(type={action['type']}, risk={action['risk_level']})"
)
except Exception as e:
logger.error(f"Failed to save recommendation {action_id}: {e}")
def _generate_disk_cleanup_recommendation(self, node: str):
"""
Generate a disk_cleanup action when node-agent reports critical disk
pressure (>85 %) on a node that supports automated Docker cleanup.
This is an OPERATOR-APPROVED action (risk=guarded): it runs
`docker image prune -a -f` and `docker volume prune -f`, which are
more aggressive than the safe auto-cleanup the node-agent runs itself.
Nodes in NO_DISK_CLEANUP_NODES never reach this method (filtered in
reconcile) because their disk fullness is caused by application data
(Frigate, HA) that the operator must handle manually.
"""
action_id = f"disk-cleanup-{node}"
for state in ("pending", "approved", "running"):
if (ACTIONS_DIR / state / f"{action_id}.json").exists():
logger.debug(f"Skipping {action_id}: already in state '{state}'")
return
action = {
"action_id": action_id,
"timestamp": time.time(),
"type": "disk_cleanup",
"node": node,
"service": "",
"risk_level": "guarded",
"confidence": 0.85,
"description": (
f"Aggressive disk cleanup on {node}: docker image prune -a "
f"and docker volume prune (requires operator approval)"
),
"status": "pending",
"payload": {
"reason": "disk_pressure",
"commands": [
"docker image prune -a -f",
"docker volume prune -f",
],
},
}
action_path = ACTIONS_DIR / "pending" / f"{action_id}.json"
try:
_atomic_write_json(action_path, action)
logger.info(
f"Generated disk cleanup recommendation: {action_id} "
f"(node={node}, risk=guarded)"
)
except Exception as e:
logger.error(f"Failed to save disk cleanup recommendation {action_id}: {e}")
def _cancel_resolved_pending_actions(self):
"""
Auto-cancel pending service actions (redeploy / container_restart) whose
target service is now healthy in the actual state.
This keeps the action queue clean: when node-agent starts reporting
service_healthy for a container that previously had no world-state entry,
the pending 'missing_service' redeploy action that was generated before
the first health confirmation should be removed automatically rather than
sitting in the queue until an operator manually rejects it.
Only pending actions are considered approved/running actions have already
been committed to by the operator and must not be cancelled automatically.
"""
cancelled_dir = ACTIONS_DIR / "cancelled"
cancelled_dir.mkdir(parents=True, exist_ok=True)
pending_dir = ACTIONS_DIR / "pending"
if not pending_dir.exists():
return
for action_file in list(pending_dir.glob("*.json")):
try:
with open(action_file, "r") as f:
action = json.load(f)
except Exception as e:
logger.error(f"Failed to read action {action_file.name}: {e}")
continue
action_type = action.get("type")
node = action.get("node")
service = action.get("service")
# Only auto-cancel service-level actions (not disk_cleanup)
if action_type not in ("redeploy", "container_restart"):
continue
if not node or not service:
continue
svc_key = f"{node}/{service}"
cancel_reason = None
# Case 1: service is no longer in desired state (removed from services.yaml
# or marked monitor:false). The action was generated under old config.
if svc_key not in self.desired_state["services"]:
cancel_reason = "service_removed_from_desired_state"
# Case 2: drift resolved — service is now healthy in actual state.
elif self.actual_state["services"].get(svc_key, {}).get("status") == "healthy":
cancel_reason = "drift_resolved_auto"
if cancel_reason:
dest = cancelled_dir / action_file.name
try:
action["status"] = "cancelled"
action["cancelled_reason"] = cancel_reason
action["cancelled_at"] = time.time()
_atomic_write_json(dest, action)
action_file.unlink()
logger.info(
f"Auto-cancelled {action_file.name}: "
f"{svc_key}{cancel_reason}"
)
except Exception as e:
logger.error(f"Failed to cancel action {action_file.name}: {e}")
# ------------------------------------------------------------------
# HA diagnostic event routing
# ------------------------------------------------------------------
def _process_ha_events(self):
"""Scan the events directory for unprocessed ha_* events and route them."""
if not EVENTS_DIR.exists():
return
for event_file in sorted(EVENTS_DIR.glob("**/*.json")):
event_id = event_file.stem
if event_id in self._ha_processed_event_ids:
continue
self._ha_processed_event_ids.add(event_id)
try:
with open(event_file) as f:
event = json.load(f)
except Exception as e:
logger.debug(f"Could not read event {event_file}: {e}")
continue
if not event.get("type", "").startswith("ha_"):
continue
self._route_ha_event(event)
def _route_ha_event(self, event: dict):
event_type = event.get("type", "")
node = event.get("node", "")
if not node:
return
if event_type in HA_CONTAINER_RESTART_EVENTS:
if self._is_ha_in_transition(node):
logger.debug(
f"Suppressing {event_type} on {node}: homeassistant in transition"
)
return
if HA_DIAG_SHADOW_MODE:
logger.info(
"shadow_mode: suppressed container_restart for %s", event_type
)
self._generate_ha_shadow_alert(node, event)
else:
self._generate_ha_container_restart(node, event)
elif event_type == "ha_websocket_recovered":
self._cancel_ha_container_restart(node)
elif event_type in HA_ALERT_ONLY_EVENTS:
if self._is_ha_in_transition(node):
logger.debug(
f"Suppressing {event_type} on {node}: homeassistant in transition"
)
return
self._generate_ha_alert_only(node, event)
def _is_ha_in_transition(self, node: str) -> bool:
"""Return True if homeassistant container had a recent containers_not_running incident.
Suppresses ha_* alerts during planned HA restarts/updates to avoid
flooding the operator with secondary diagnostic alerts.
"""
svc_key = f"{node}/homeassistant"
svc_info = self.actual_state["services"].get(svc_key, {})
incident_id = svc_info.get("incident_id")
if not incident_id:
return False
incident = self.actual_state["incidents"].get(incident_id, {})
return (
incident.get("status") == "active"
and incident.get("trigger_type") == "containers_not_running"
and time.time() - (incident.get("last_occurrence") or 0) < HA_TRANSITION_WINDOW
)
def _ha_action_recently_completed(self, action_id: str, cooldown: int) -> bool:
"""Return True if action completed/rejected/cancelled within the cooldown window."""
for state in ("completed", "rejected", "cancelled"):
path = ACTIONS_DIR / state / f"{action_id}.json"
if path.exists():
try:
with open(path) as f:
data = json.load(f)
finished = (
data.get("finished_at")
or data.get("cancelled_at")
or data.get("updated_at")
or 0
)
if time.time() - finished < cooldown:
return True
except Exception:
pass
return False
def _generate_ha_container_restart(self, node: str, event: dict):
service = "homeassistant"
action_id = f"container-restart-{node}-{service}"
for state in ("pending", "approved", "running"):
if (ACTIONS_DIR / state / f"{action_id}.json").exists():
logger.debug(f"Skipping {action_id}: already in state '{state}'")
return
if self._ha_action_recently_completed(action_id, HA_WEBSOCKET_RESTART_COOLDOWN):
logger.debug(
f"Skipping {action_id}: within {HA_WEBSOCKET_RESTART_COOLDOWN}s cooldown"
)
return
payload = dict(event.get("payload", {}))
payload["reason"] = "ha_websocket_dead"
payload["svc_key"] = f"{node}/{service}"
container_name = self._get_container_name(service)
action = {
"action_id": action_id,
"timestamp": time.time(),
"type": "container_restart",
"node": node,
"service": service,
"container_name": container_name,
"risk_level": "low",
"confidence": 0.9,
"description": (
f"Restart '{container_name}' on {node}: HA WebSocket unresponsive"
),
"status": "pending",
"payload": payload,
}
self._write_pending_action(action)
def _generate_ha_shadow_alert(self, node: str, event: dict):
"""Shadow-mode downgrade: emit alert_only instead of container_restart.
Uses the same action_id and cooldown as the real restart so that
cooldown semantics are identical regardless of shadow mode state.
"""
service = "homeassistant"
action_id = f"container-restart-{node}-{service}"
for state in ("pending", "approved", "running"):
if (ACTIONS_DIR / state / f"{action_id}.json").exists():
logger.debug(f"Skipping {action_id}: already in state '{state}'")
return
if self._ha_action_recently_completed(action_id, HA_WEBSOCKET_RESTART_COOLDOWN):
logger.debug(
f"Skipping {action_id}: within {HA_WEBSOCKET_RESTART_COOLDOWN}s cooldown"
)
return
payload = dict(event.get("payload", {}))
payload["reason"] = "ha_websocket_dead"
payload["svc_key"] = f"{node}/{service}"
payload["shadow_mode"] = True
action = {
"action_id": action_id,
"timestamp": time.time(),
"type": "alert_only",
"node": node,
"service": service,
"risk_level": "info",
"confidence": 0.9,
"description": (
f"[SHADOW MODE] would have triggered container_restart "
f"for {service} on {node}: HA WebSocket unresponsive"
),
"status": "pending",
"payload": payload,
}
self._write_pending_action(action)
def _generate_ha_alert_only(self, node: str, event: dict):
event_type = event.get("type", "")
suffix = _HA_ALERT_ID_SUFFIX.get(event_type, event_type.replace("_", "-"))
action_id = f"alert-ha-{suffix}-{node}"
for state in ("pending", "approved", "running"):
if (ACTIONS_DIR / state / f"{action_id}.json").exists():
logger.debug(f"Skipping {action_id}: already in state '{state}'")
return
if self._ha_action_recently_completed(action_id, HA_ALERT_COOLDOWN):
logger.debug(
f"Skipping {action_id}: within {HA_ALERT_COOLDOWN}s cooldown"
)
return
payload = dict(event.get("payload", {}))
payload["reason"] = event_type
action = {
"action_id": action_id,
"timestamp": time.time(),
"type": "alert_only",
"node": node,
"service": event.get("service", "homeassistant"),
"risk_level": "info",
"confidence": 1.0,
"description": event.get(
"message", f"HA diagnostic alert: {event_type} on {node}"
),
"status": "pending",
"payload": payload,
}
self._write_pending_action(action)
def _cancel_ha_container_restart(self, node: str):
"""Move a pending ha_websocket_dead container_restart to cancelled on recovery."""
action_id = f"container-restart-{node}-homeassistant"
pending_path = ACTIONS_DIR / "pending" / f"{action_id}.json"
if not pending_path.exists():
return
cancelled_dir = ACTIONS_DIR / "cancelled"
cancelled_dir.mkdir(parents=True, exist_ok=True)
dest = cancelled_dir / f"{action_id}.json"
try:
with open(pending_path) as f:
action = json.load(f)
action["status"] = "cancelled"
action["cancelled_reason"] = "ha_websocket_recovered"
action["cancelled_at"] = time.time()
_atomic_write_json(dest, action)
pending_path.unlink()
logger.info(f"Cancelled {action_id}: ha_websocket_recovered on {node}")
except Exception as e:
logger.error(f"Failed to cancel {action_id}: {e}")
def _write_pending_action(self, action: dict):
action_id = action["action_id"]
action_path = ACTIONS_DIR / "pending" / f"{action_id}.json"
try:
_atomic_write_json(action_path, action)
logger.info(
f"Generated HA action: {action_id} "
f"(type={action['type']}, risk={action['risk_level']})"
)
except Exception as e:
logger.error(f"Failed to save action {action_id}: {e}")
def loop(self, interval=30):
logger.info("Starting supervisor loop")
while True:
self.reconcile()
time.sleep(interval)
if __name__ == "__main__":
supervisor = Supervisor()
supervisor.loop()

View file

View file

@ -0,0 +1,333 @@
"""Tests for incident lifecycle: auto-resolve, orphan detection, timestamp parsing."""
from __future__ import annotations
import json
import sys
import time
from pathlib import Path
import pytest
# Observer lives outside the control-plane package; add scripts/ to path.
sys.path.insert(0, str(Path(__file__).parent.parent.parent.parent / "scripts"))
from observer.observer import Observer, _parse_ts, _atomic_write_json
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
def _make_observer(tmp_path: Path) -> Observer:
"""Return an Observer with all runtime paths redirected to tmp_path."""
import observer.observer as obs_mod
world = tmp_path / "world"
state = tmp_path / "state"
events = tmp_path / "events"
logs = tmp_path / "logs"
repo = tmp_path / "repo"
for d in (world, state, events, logs, repo / "inventory", repo / "hosts"):
d.mkdir(parents=True, exist_ok=True)
# Minimal topology so inventory isn't empty (avoids prune-guard early-return)
(repo / "inventory" / "topology.yaml").write_text(
"nodes:\n vps:\n roles: [control-plane]\n connectivity: {}\n"
)
original_world = obs_mod.WORLD_DIR
original_state = obs_mod.STATE_DIR
original_events = obs_mod.EVENTS_DIR
original_logs = obs_mod.LOGS_DIR
original_inventory = obs_mod.INVENTORY_TOPOLOGY
original_repo = obs_mod.REPO_ROOT
obs_mod.WORLD_DIR = world
obs_mod.STATE_DIR = state
obs_mod.EVENTS_DIR = events
obs_mod.LOGS_DIR = logs
obs_mod.INVENTORY_TOPOLOGY = repo / "inventory" / "topology.yaml"
obs_mod.REPO_ROOT = repo
obs = Observer()
# Restore module-level constants (monkeypatching at module level is sufficient
# for the Observer instance which captures paths at construction time via globals)
obs_mod.WORLD_DIR = original_world
obs_mod.STATE_DIR = original_state
obs_mod.EVENTS_DIR = original_events
obs_mod.LOGS_DIR = original_logs
obs_mod.INVENTORY_TOPOLOGY = original_inventory
obs_mod.REPO_ROOT = original_repo
return obs
def _make_observer_simple(tmp_path: Path):
"""Return an Observer instance and patch its world_state in-place."""
import observer.observer as obs_mod
world = tmp_path / "world"
state = tmp_path / "state"
events = tmp_path / "events"
logs = tmp_path / "logs"
repo = tmp_path / "repo"
for d in (world, state, events, logs, repo / "inventory", repo / "hosts"):
d.mkdir(parents=True, exist_ok=True)
(repo / "inventory" / "topology.yaml").write_text(
"nodes:\n vps:\n roles: [control-plane]\n connectivity: {}\n"
)
# Patch before construction
obs_mod.WORLD_DIR = world
obs_mod.STATE_DIR = state
obs_mod.EVENTS_DIR = events
obs_mod.LOGS_DIR = logs
obs_mod.INVENTORY_TOPOLOGY = repo / "inventory" / "topology.yaml"
obs_mod.REPO_ROOT = repo
obs = Observer()
return obs
# ---------------------------------------------------------------------------
# 1. _parse_ts — timestamp normalisation
# ---------------------------------------------------------------------------
def test_parse_ts_int():
ts = int(time.time()) - 3600
assert abs(_parse_ts(ts) - ts) < 1
def test_parse_ts_float():
ts = time.time() - 100.5
assert abs(_parse_ts(ts) - ts) < 0.01
def test_parse_ts_iso_string():
# ISO format as emitted by events.py / stability-agent
from datetime import datetime, timezone
iso = "2026-06-01T00:03:22Z"
expected = datetime(2026, 6, 1, 0, 3, 22, tzinfo=timezone.utc).timestamp()
result = _parse_ts(iso)
assert result > 0
assert isinstance(result, float)
assert abs(result - expected) < 1
def test_parse_ts_none_returns_zero():
assert _parse_ts(None) == 0.0
def test_parse_ts_garbage_returns_zero():
assert _parse_ts("not-a-date") == 0.0
def test_parse_ts_zero_int():
assert _parse_ts(0) == 0.0
# ---------------------------------------------------------------------------
# 2. Lifecycle: service_healthy event resolves linked incident
# ---------------------------------------------------------------------------
def test_service_healthy_resolves_active_incident(tmp_path):
obs = _make_observer_simple(tmp_path)
inc_id = "inc-111-vps-outline"
obs.world_state["services"]["vps/outline"] = {
"node": "vps", "service": "outline",
"status": "unhealthy", "last_check": None,
"incident_id": inc_id,
}
obs.world_state["incidents"][inc_id] = {
"id": inc_id, "node": "vps", "service": "outline",
"status": "active", "trigger_type": "service_unhealthy",
"started_at": int(time.time()) - 600,
"last_occurrence": int(time.time()) - 600,
"occurrence_count": 1, "events": [],
}
obs.process_event({
"type": "service_healthy",
"node": "vps",
"service": "outline",
"severity": "info",
"timestamp": int(time.time()),
"payload": {},
})
assert obs.world_state["services"]["vps/outline"]["status"] == "healthy"
assert obs.world_state["services"]["vps/outline"]["incident_id"] is None
assert obs.world_state["incidents"][inc_id]["status"] == "resolved"
def test_service_healthy_does_not_resolve_other_incidents(tmp_path):
"""service_healthy for service A must not touch incident for service B."""
obs = _make_observer_simple(tmp_path)
inc_b = "inc-222-vps-supervisor"
obs.world_state["services"]["vps/supervisor"] = {
"node": "vps", "service": "supervisor",
"status": "unhealthy", "last_check": None,
"incident_id": inc_b,
}
obs.world_state["incidents"][inc_b] = {
"id": inc_b, "status": "active",
"last_occurrence": int(time.time()) - 300,
}
obs.process_event({
"type": "service_healthy",
"node": "vps",
"service": "outline", # different service
"severity": "info",
"timestamp": int(time.time()),
"payload": {},
})
assert obs.world_state["incidents"][inc_b]["status"] == "active"
# ---------------------------------------------------------------------------
# 3. _prune_stale_world: healthy-service-linked incident → immediate resolve
# ---------------------------------------------------------------------------
def test_prune_resolves_healthy_linked_incident(tmp_path):
"""If a service is healthy but still points at an active incident, resolve it."""
obs = _make_observer_simple(tmp_path)
inc_id = "inc-333-vps-outline"
obs.world_state["services"]["vps/outline"] = {
"node": "vps", "service": "outline",
"status": "healthy", # <-- healthy but incident_id still set
"last_check": None,
"incident_id": inc_id,
}
obs.world_state["incidents"][inc_id] = {
"id": inc_id, "status": "active",
"started_at": int(time.time()) - 7200,
"last_occurrence": int(time.time()) - 7200,
}
obs._prune_stale_world()
assert obs.world_state["services"]["vps/outline"]["incident_id"] is None
assert obs.world_state["incidents"][inc_id]["status"] == "resolved"
def test_prune_resolves_healthy_linked_incident_iso_timestamp(tmp_path):
"""Healthy-linked incident with ISO-string last_occurrence must still resolve."""
obs = _make_observer_simple(tmp_path)
inc_id = "inc-444-vps-outline"
obs.world_state["services"]["vps/outline"] = {
"node": "vps", "service": "outline",
"status": "healthy", "last_check": None, "incident_id": inc_id,
}
obs.world_state["incidents"][inc_id] = {
"id": inc_id, "status": "active",
"last_occurrence": "2026-06-01T00:03:22Z", # ISO string from events.py
}
obs._prune_stale_world() # must not raise TypeError
assert obs.world_state["incidents"][inc_id]["status"] == "resolved"
# ---------------------------------------------------------------------------
# 4. _prune_stale_world: orphaned incident (no service link) → resolve after 5 min
# ---------------------------------------------------------------------------
def test_prune_resolves_orphaned_incident_old_enough(tmp_path):
"""Orphaned active incident older than 5 min must be auto-resolved."""
obs = _make_observer_simple(tmp_path)
inc_id = "inc-555-vps-supervisor"
# No service entry links to this incident
obs.world_state["incidents"][inc_id] = {
"id": inc_id, "status": "active", "node": "vps", "service": "supervisor",
"last_occurrence": int(time.time()) - 400, # 6.7 min ago
}
obs._prune_stale_world()
assert obs.world_state["incidents"][inc_id]["status"] == "resolved"
def test_prune_does_not_resolve_orphaned_incident_too_recent(tmp_path):
"""Orphaned incident younger than 5 min must stay active (guard against race)."""
obs = _make_observer_simple(tmp_path)
inc_id = "inc-666-vps-supervisor"
obs.world_state["incidents"][inc_id] = {
"id": inc_id, "status": "active",
"last_occurrence": int(time.time()) - 60, # 1 min ago — within guard
}
obs._prune_stale_world()
assert obs.world_state["incidents"][inc_id]["status"] == "active"
def test_prune_resolves_orphaned_incident_iso_timestamp(tmp_path):
"""Orphaned incident with ISO-string last_occurrence must resolve correctly."""
obs = _make_observer_simple(tmp_path)
inc_id = "inc-777-vps-outline"
# ISO timestamp well in the past (2026-06-01)
obs.world_state["incidents"][inc_id] = {
"id": inc_id, "status": "active",
"last_occurrence": "2026-06-01T00:03:22Z",
}
obs._prune_stale_world() # must not raise TypeError
assert obs.world_state["incidents"][inc_id]["status"] == "resolved"
def test_prune_does_not_touch_linked_incident(tmp_path):
"""An active incident still linked from a non-healthy service must stay active."""
obs = _make_observer_simple(tmp_path)
inc_id = "inc-888-vps-outline"
obs.world_state["services"]["vps/outline"] = {
"node": "vps", "service": "outline",
"status": "unhealthy", # <-- still unhealthy
"last_check": None,
"incident_id": inc_id,
}
obs.world_state["incidents"][inc_id] = {
"id": inc_id, "status": "active",
"last_occurrence": int(time.time()) - 3600,
}
obs._prune_stale_world()
assert obs.world_state["incidents"][inc_id]["status"] == "active"
# ---------------------------------------------------------------------------
# 5. 7-day stale incident prune with ISO resolved_at
# ---------------------------------------------------------------------------
def test_prune_removes_old_resolved_incident_iso_resolved_at(tmp_path):
"""Resolved incidents with ISO-string resolved_at older than 7 days must be pruned."""
obs = _make_observer_simple(tmp_path)
inc_id = "inc-old-resolved"
obs.world_state["incidents"][inc_id] = {
"id": inc_id, "status": "resolved",
"resolved_at": "2026-05-01T00:00:00Z", # >7 days before 2026-06-03
}
obs._prune_stale_world()
assert inc_id not in obs.world_state["incidents"]
def test_prune_keeps_recently_resolved_incident(tmp_path):
"""Resolved incidents within 7 days must be kept."""
obs = _make_observer_simple(tmp_path)
inc_id = "inc-recent-resolved"
obs.world_state["incidents"][inc_id] = {
"id": inc_id, "status": "resolved",
"resolved_at": time.time() - 86400, # 1 day ago
}
obs._prune_stale_world()
assert inc_id in obs.world_state["incidents"]

View file

@ -0,0 +1,199 @@
"""Tests for atomic writes and resilient world-state loading in the supervisor."""
from __future__ import annotations
import json
import sys
import time
from pathlib import Path
import pytest
sys.path.insert(0, str(Path(__file__).parent.parent / "src"))
import supervisor as supervisor_module
from supervisor import Supervisor, _atomic_write_json
# ---------------------------------------------------------------------------
# Helpers (reused from test_supervisor_ha)
# ---------------------------------------------------------------------------
def _setup_supervisor(tmp_path: Path, monkeypatch) -> Supervisor:
actions = tmp_path / "actions"
events = tmp_path / "events"
world = tmp_path / "world"
repo = tmp_path / "repo"
for d in (actions, events, world, repo / "hosts"):
d.mkdir(parents=True, exist_ok=True)
monkeypatch.setattr(supervisor_module, "ACTIONS_DIR", actions)
monkeypatch.setattr(supervisor_module, "EVENTS_DIR", events)
monkeypatch.setattr(supervisor_module, "WORLD_DIR", world)
monkeypatch.setattr(supervisor_module, "REPO_ROOT", repo)
sup = Supervisor()
sup.desired_state = {"services": {}}
sup.actual_state = {"services": {}, "nodes": {}, "incidents": {}}
return sup
# ---------------------------------------------------------------------------
# 1. atomic_write_json correctness
# ---------------------------------------------------------------------------
def test_atomic_write_json_produces_valid_json(tmp_path):
path = tmp_path / "out.json"
data = {"services": {"vps/outline": {"status": "healthy"}}, "count": 42}
_atomic_write_json(path, data)
assert path.exists(), "output file must exist after atomic write"
loaded = json.loads(path.read_text())
assert loaded == data
def test_atomic_write_json_no_tmp_left_behind(tmp_path):
path = tmp_path / "world.json"
_atomic_write_json(path, {"ok": True})
tmp = path.with_suffix(".tmp")
assert not tmp.exists(), ".tmp must be cleaned up by os.replace"
def test_atomic_write_json_overwrites_existing(tmp_path):
path = tmp_path / "state.json"
path.write_text('{"old": true}')
_atomic_write_json(path, {"new": True})
assert json.loads(path.read_text()) == {"new": True}
def test_atomic_write_json_nested_structure(tmp_path):
path = tmp_path / "complex.json"
data = {
"nodes": {"vps": {"status": "online", "disk_usage_pct": 42}},
"incidents": {},
"list": [1, 2, 3],
}
_atomic_write_json(path, data)
assert json.loads(path.read_text()) == data
# ---------------------------------------------------------------------------
# 2. Resilient loader: empty / truncated file → skip cycle, no drift
# ---------------------------------------------------------------------------
def _populate_desired(sup: Supervisor, svc_key: str = "vps/outline"):
node, service = svc_key.split("/", 1)
sup.desired_state["services"][svc_key] = {
"node": node,
"service": service,
"desired": "running",
}
def test_empty_services_json_skips_reconcile(tmp_path, monkeypatch):
"""Empty services.json (truncated write) must not generate any redeploy action."""
sup = _setup_supervisor(tmp_path, monkeypatch)
_populate_desired(sup)
# Write empty services.json — simulates a mid-write truncation
(tmp_path / "world" / "services.json").write_text("")
(tmp_path / "world" / "nodes.json").write_text("{}")
(tmp_path / "world" / "incidents.json").write_text("{}")
sup.reconcile()
pending = list((tmp_path / "actions" / "pending").glob("*.json"))
assert pending == [], f"No actions should be generated on empty state file, got: {[p.name for p in pending]}"
def test_truncated_services_json_skips_reconcile(tmp_path, monkeypatch):
"""Partially-written (truncated mid-write) JSON must not generate any action."""
sup = _setup_supervisor(tmp_path, monkeypatch)
_populate_desired(sup)
(tmp_path / "world" / "services.json").write_text('{"vps/outline": {"status": "hea')
(tmp_path / "world" / "nodes.json").write_text("{}")
(tmp_path / "world" / "incidents.json").write_text("{}")
sup.reconcile()
pending = list((tmp_path / "actions" / "pending").glob("*.json"))
assert pending == [], f"No actions expected on truncated state, got: {[p.name for p in pending]}"
def test_empty_incidents_json_skips_reconcile(tmp_path, monkeypatch):
"""Empty incidents.json (any world-state file failing) skips full cycle."""
sup = _setup_supervisor(tmp_path, monkeypatch)
_populate_desired(sup)
(tmp_path / "world" / "services.json").write_text("{}")
(tmp_path / "world" / "nodes.json").write_text("{}")
(tmp_path / "world" / "incidents.json").write_text("")
sup.reconcile()
pending = list((tmp_path / "actions" / "pending").glob("*.json"))
assert pending == [], f"No actions expected when any state file is unreadable, got: {[p.name for p in pending]}"
def test_load_actual_state_returns_false_on_empty_file(tmp_path, monkeypatch):
"""_load_actual_state must return False (not raise) when a file is empty."""
sup = _setup_supervisor(tmp_path, monkeypatch)
(tmp_path / "world" / "services.json").write_text("")
(tmp_path / "world" / "nodes.json").write_text("{}")
(tmp_path / "world" / "incidents.json").write_text("{}")
result = sup._load_actual_state()
assert result is False
def test_load_actual_state_returns_true_on_valid_files(tmp_path, monkeypatch):
"""_load_actual_state returns True and populates actual_state on valid files."""
sup = _setup_supervisor(tmp_path, monkeypatch)
services = {"vps/outline": {"node": "vps", "service": "outline", "status": "healthy"}}
(tmp_path / "world" / "services.json").write_text(json.dumps(services))
(tmp_path / "world" / "nodes.json").write_text('{"vps": {"status": "online"}}')
(tmp_path / "world" / "incidents.json").write_text("{}")
result = sup._load_actual_state()
assert result is True
assert "vps/outline" in sup.actual_state["services"]
def test_parse_failure_preserves_last_known_good_state(tmp_path, monkeypatch):
"""When a file becomes unreadable, actual_state retains the previous good values."""
sup = _setup_supervisor(tmp_path, monkeypatch)
# First successful load
services = {"vps/outline": {"node": "vps", "service": "outline", "status": "healthy"}}
(tmp_path / "world" / "services.json").write_text(json.dumps(services))
(tmp_path / "world" / "nodes.json").write_text("{}")
(tmp_path / "world" / "incidents.json").write_text("{}")
assert sup._load_actual_state() is True
assert "vps/outline" in sup.actual_state["services"]
# File becomes empty (race condition)
(tmp_path / "world" / "services.json").write_text("")
assert sup._load_actual_state() is False
# State must be unchanged from the previous good load
assert "vps/outline" in sup.actual_state["services"], \
"Last-known-good state must be preserved on parse failure"
def test_healthy_service_does_not_generate_action(tmp_path, monkeypatch):
"""A desired service that appears healthy in world state generates no action."""
sup = _setup_supervisor(tmp_path, monkeypatch)
_populate_desired(sup)
services = {"vps/outline": {"node": "vps", "service": "outline", "status": "healthy"}}
(tmp_path / "world" / "services.json").write_text(json.dumps(services))
(tmp_path / "world" / "nodes.json").write_text("{}")
(tmp_path / "world" / "incidents.json").write_text("{}")
sup.reconcile()
pending = list((tmp_path / "actions" / "pending").glob("*.json"))
assert pending == [], "Healthy service must not generate any action"

View file

@ -0,0 +1,395 @@
"""Tests for HA diagnostic event routing in the supervisor."""
from __future__ import annotations
import json
import sys
import time
from pathlib import Path
import pytest
# Add src/ to path so we can import supervisor without installing
sys.path.insert(0, str(Path(__file__).parent.parent / "src"))
import supervisor as supervisor_module
from supervisor import Supervisor
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
def _make_event(event_type: str, node: str = "chelsty-ha", service: str = "homeassistant",
payload: dict | None = None, message: str = "") -> dict:
return {
"id": f"evt-{node}-{int(time.time())}-{event_type}-{service}-1",
"type": event_type,
"node": node,
"service": service,
"severity": "warning",
"timestamp": int(time.time()),
"message": message or f"Test event: {event_type}",
"payload": payload or {"location_tag": "chelsty"},
}
def _write_event(events_dir: Path, event: dict) -> Path:
path = events_dir / f"{event['id']}.json"
path.write_text(json.dumps(event))
return path
def _setup_supervisor(tmp_path: Path, monkeypatch) -> Supervisor:
"""Return a Supervisor instance with all paths redirected to tmp_path."""
actions = tmp_path / "actions"
events = tmp_path / "events"
world = tmp_path / "world"
repo = tmp_path / "repo"
state = tmp_path / "state"
for d in (actions, events, world, repo / "hosts", state):
d.mkdir(parents=True, exist_ok=True)
monkeypatch.setattr(supervisor_module, "ACTIONS_DIR", actions)
monkeypatch.setattr(supervisor_module, "EVENTS_DIR", events)
monkeypatch.setattr(supervisor_module, "WORLD_DIR", world)
monkeypatch.setattr(supervisor_module, "REPO_ROOT", repo)
sup = Supervisor()
# Empty desired/actual state so reconcile drift loop is a no-op
sup.desired_state = {"services": {}}
sup.actual_state = {"services": {}, "nodes": {}, "incidents": {}}
return sup
def _pending(tmp_path: Path, action_id: str) -> Path:
return tmp_path / "actions" / "pending" / f"{action_id}.json"
def _read_action(tmp_path: Path, state: str, action_id: str) -> dict:
return json.loads((tmp_path / "actions" / state / f"{action_id}.json").read_text())
# ---------------------------------------------------------------------------
# 1. Each event type → correct action type
# ---------------------------------------------------------------------------
def test_ha_websocket_dead_generates_container_restart(tmp_path, monkeypatch):
monkeypatch.setattr(supervisor_module, "HA_DIAG_SHADOW_MODE", False)
sup = _setup_supervisor(tmp_path, monkeypatch)
events_dir = tmp_path / "events"
_write_event(events_dir, _make_event("ha_websocket_dead"))
sup._process_ha_events()
action_id = "container-restart-chelsty-ha-homeassistant"
assert _pending(tmp_path, action_id).exists()
action = _read_action(tmp_path, "pending", action_id)
assert action["type"] == "container_restart"
assert action["service"] == "homeassistant"
assert action["node"] == "chelsty-ha"
@pytest.mark.parametrize("event_type,expected_suffix", [
("ha_integration_failed", "integration-failed"),
("ha_entity_unavailable_long", "entity-unavailable"),
("ha_automation_failing", "automation-failing"),
("ha_update_available", "update-available"),
("ha_recorder_lag", "recorder-lag"),
("ha_system_health_degraded", "system-health-degraded"),
])
def test_alert_only_events_generate_alert_actions(
tmp_path, monkeypatch, event_type, expected_suffix
):
sup = _setup_supervisor(tmp_path, monkeypatch)
_write_event(tmp_path / "events", _make_event(event_type))
sup._process_ha_events()
action_id = f"alert-ha-{expected_suffix}-chelsty-ha"
assert _pending(tmp_path, action_id).exists(), f"No pending action for {event_type}"
action = _read_action(tmp_path, "pending", action_id)
assert action["type"] == "alert_only"
assert action["node"] == "chelsty-ha"
# ---------------------------------------------------------------------------
# 2. Transition suppression
# ---------------------------------------------------------------------------
def test_ha_websocket_dead_suppressed_during_transition(tmp_path, monkeypatch):
sup = _setup_supervisor(tmp_path, monkeypatch)
# Set up world state: homeassistant has an active containers_not_running incident
inc_id = "inc-123-chelsty-ha-homeassistant"
sup.actual_state["services"]["chelsty-ha/homeassistant"] = {
"node": "chelsty-ha", "service": "homeassistant",
"status": "unhealthy", "incident_id": inc_id,
}
sup.actual_state["incidents"][inc_id] = {
"id": inc_id, "status": "active",
"trigger_type": "containers_not_running",
"last_occurrence": time.time() - 60, # 1 min ago — within 5-min window
}
_write_event(tmp_path / "events", _make_event("ha_websocket_dead"))
sup._process_ha_events()
action_id = "container-restart-chelsty-ha-homeassistant"
assert not _pending(tmp_path, action_id).exists(), "Action should be suppressed during transition"
def test_ha_alert_suppressed_during_transition(tmp_path, monkeypatch):
sup = _setup_supervisor(tmp_path, monkeypatch)
inc_id = "inc-456-chelsty-ha-homeassistant"
sup.actual_state["services"]["chelsty-ha/homeassistant"] = {
"node": "chelsty-ha", "service": "homeassistant",
"status": "unhealthy", "incident_id": inc_id,
}
sup.actual_state["incidents"][inc_id] = {
"id": inc_id, "status": "active",
"trigger_type": "containers_not_running",
"last_occurrence": time.time() - 30,
}
for event_type in supervisor_module.HA_ALERT_ONLY_EVENTS:
_write_event(tmp_path / "events", _make_event(event_type))
sup._process_ha_events()
for suffix in supervisor_module._HA_ALERT_ID_SUFFIX.values():
action_id = f"alert-ha-{suffix}-chelsty-ha"
assert not _pending(tmp_path, action_id).exists(), \
f"{action_id} should be suppressed"
def test_transition_suppression_expires_after_window(tmp_path, monkeypatch):
"""After 5 min, transition window expires and events are routed normally."""
sup = _setup_supervisor(tmp_path, monkeypatch)
inc_id = "inc-789-chelsty-ha-homeassistant"
sup.actual_state["services"]["chelsty-ha/homeassistant"] = {
"node": "chelsty-ha", "service": "homeassistant",
"status": "unhealthy", "incident_id": inc_id,
}
sup.actual_state["incidents"][inc_id] = {
"id": inc_id, "status": "active",
"trigger_type": "containers_not_running",
"last_occurrence": time.time() - 400, # 6.7 min ago — outside window
}
_write_event(tmp_path / "events", _make_event("ha_websocket_dead"))
sup._process_ha_events()
action_id = "container-restart-chelsty-ha-homeassistant"
assert _pending(tmp_path, action_id).exists(), "Should not be suppressed after window"
# ---------------------------------------------------------------------------
# 3. Recovery cancellation
# ---------------------------------------------------------------------------
def test_ha_websocket_recovered_cancels_pending_restart(tmp_path, monkeypatch):
sup = _setup_supervisor(tmp_path, monkeypatch)
events_dir = tmp_path / "events"
actions = tmp_path / "actions"
(actions / "cancelled").mkdir(parents=True, exist_ok=True)
# Pre-create a pending container_restart for homeassistant
action_id = "container-restart-chelsty-ha-homeassistant"
pending_action = {
"action_id": action_id, "type": "container_restart",
"node": "chelsty-ha", "service": "homeassistant",
"status": "pending", "timestamp": time.time(),
}
_pending(tmp_path, action_id).write_text(json.dumps(pending_action))
_write_event(events_dir, _make_event("ha_websocket_recovered"))
sup._process_ha_events()
assert not _pending(tmp_path, action_id).exists(), "Pending action should be cancelled"
cancelled = actions / "cancelled" / f"{action_id}.json"
assert cancelled.exists()
data = json.loads(cancelled.read_text())
assert data["cancelled_reason"] == "ha_websocket_recovered"
def test_ha_websocket_recovered_no_pending_action_is_noop(tmp_path, monkeypatch):
"""Recovery event when no pending restart exists must not raise."""
sup = _setup_supervisor(tmp_path, monkeypatch)
_write_event(tmp_path / "events", _make_event("ha_websocket_recovered"))
sup._process_ha_events() # should not raise
# ---------------------------------------------------------------------------
# 4. Cooldown
# ---------------------------------------------------------------------------
def test_ha_websocket_dead_cooldown_prevents_second_restart(tmp_path, monkeypatch):
"""Two ha_websocket_dead events within 30 min → only one container_restart."""
sup = _setup_supervisor(tmp_path, monkeypatch)
events_dir = tmp_path / "events"
actions = tmp_path / "actions"
(actions / "completed").mkdir(parents=True, exist_ok=True)
# First event → action generated
_write_event(events_dir, _make_event("ha_websocket_dead", service="homeassistant"))
sup._process_ha_events()
action_id = "container-restart-chelsty-ha-homeassistant"
assert _pending(tmp_path, action_id).exists()
# Simulate: action completed recently (< 30 min ago)
action_data = json.loads(_pending(tmp_path, action_id).read_text())
action_data["status"] = "completed"
action_data["finished_at"] = time.time() - 60 # 1 min ago
(actions / "completed" / f"{action_id}.json").write_text(json.dumps(action_data))
_pending(tmp_path, action_id).unlink()
# Second event — should be suppressed by cooldown
event2 = _make_event("ha_websocket_dead", service="homeassistant")
event2["id"] = event2["id"] + "-2" # different event ID
_write_event(events_dir, event2)
sup._process_ha_events()
assert not _pending(tmp_path, action_id).exists(), "Second restart within cooldown should be suppressed"
def test_ha_websocket_dead_cooldown_expires(tmp_path, monkeypatch):
"""After cooldown expires, a new ha_websocket_dead should generate an action."""
sup = _setup_supervisor(tmp_path, monkeypatch)
events_dir = tmp_path / "events"
actions = tmp_path / "actions"
(actions / "completed").mkdir(parents=True, exist_ok=True)
action_id = "container-restart-chelsty-ha-homeassistant"
# Pre-populate completed action with timestamp > 30 min ago
old_action = {
"action_id": action_id, "type": "container_restart",
"status": "completed", "finished_at": time.time() - 3700, # > 30 min
}
(actions / "completed" / f"{action_id}.json").write_text(json.dumps(old_action))
_write_event(events_dir, _make_event("ha_websocket_dead"))
sup._process_ha_events()
assert _pending(tmp_path, action_id).exists(), "Should generate new restart after cooldown"
# ---------------------------------------------------------------------------
# 5. Location tag preserved
# ---------------------------------------------------------------------------
def test_location_tag_preserved_in_container_restart_payload(tmp_path, monkeypatch):
sup = _setup_supervisor(tmp_path, monkeypatch)
_write_event(tmp_path / "events",
_make_event("ha_websocket_dead", payload={"location_tag": "chelsty", "extra": "data"}))
sup._process_ha_events()
action = _read_action(tmp_path, "pending", "container-restart-chelsty-ha-homeassistant")
assert action["payload"]["location_tag"] == "chelsty"
def test_location_tag_preserved_in_alert_only_payload(tmp_path, monkeypatch):
sup = _setup_supervisor(tmp_path, monkeypatch)
_write_event(tmp_path / "events",
_make_event("ha_entity_unavailable_long",
payload={"location_tag": "ken", "count": 3}))
sup._process_ha_events()
action = _read_action(tmp_path, "pending", "alert-ha-entity-unavailable-chelsty-ha")
assert action["payload"]["location_tag"] == "ken"
# ---------------------------------------------------------------------------
# 6. Dedup — same alert type twice → only one pending action
# ---------------------------------------------------------------------------
def test_alert_only_dedup_second_event_skipped(tmp_path, monkeypatch):
sup = _setup_supervisor(tmp_path, monkeypatch)
events_dir = tmp_path / "events"
event1 = _make_event("ha_entity_unavailable_long")
event2 = _make_event("ha_entity_unavailable_long")
event2["id"] = event2["id"] + "-2"
_write_event(events_dir, event1)
_write_event(events_dir, event2)
sup._process_ha_events()
action_id = "alert-ha-entity-unavailable-chelsty-ha"
assert _pending(tmp_path, action_id).exists()
# Only one file — not duplicated
pending_files = list((tmp_path / "actions" / "pending").glob("alert-ha-entity-unavailable*.json"))
assert len(pending_files) == 1
# ---------------------------------------------------------------------------
# 7. Shadow mode
# ---------------------------------------------------------------------------
def test_shadow_mode_websocket_dead_generates_alert_not_restart(tmp_path, monkeypatch):
"""shadow_mode=True: ha_websocket_dead → alert_only with [SHADOW MODE], not container_restart."""
monkeypatch.setattr(supervisor_module, "HA_DIAG_SHADOW_MODE", True)
sup = _setup_supervisor(tmp_path, monkeypatch)
_write_event(tmp_path / "events", _make_event("ha_websocket_dead"))
sup._process_ha_events()
action_id = "container-restart-chelsty-ha-homeassistant"
assert _pending(tmp_path, action_id).exists(), "Shadow alert should be written"
action = _read_action(tmp_path, "pending", action_id)
assert action["type"] == "alert_only"
assert "[SHADOW MODE]" in action["description"]
assert action["payload"].get("shadow_mode") is True
def test_no_shadow_mode_websocket_dead_generates_container_restart(tmp_path, monkeypatch):
"""shadow_mode=False: ha_websocket_dead → container_restart (normal path)."""
monkeypatch.setattr(supervisor_module, "HA_DIAG_SHADOW_MODE", False)
sup = _setup_supervisor(tmp_path, monkeypatch)
_write_event(tmp_path / "events", _make_event("ha_websocket_dead"))
sup._process_ha_events()
action_id = "container-restart-chelsty-ha-homeassistant"
assert _pending(tmp_path, action_id).exists()
action = _read_action(tmp_path, "pending", action_id)
assert action["type"] == "container_restart"
assert "[SHADOW MODE]" not in action["description"]
def test_shadow_mode_alert_only_events_unaffected(tmp_path, monkeypatch):
"""shadow_mode=True: alert-only events (ha_entity_unavailable_long) are still routed normally."""
monkeypatch.setattr(supervisor_module, "HA_DIAG_SHADOW_MODE", True)
sup = _setup_supervisor(tmp_path, monkeypatch)
_write_event(tmp_path / "events", _make_event("ha_entity_unavailable_long"))
sup._process_ha_events()
action_id = "alert-ha-entity-unavailable-chelsty-ha"
assert _pending(tmp_path, action_id).exists()
action = _read_action(tmp_path, "pending", action_id)
assert action["type"] == "alert_only"
assert "[SHADOW MODE]" not in action["description"]
# ---------------------------------------------------------------------------
# 8. Non-HA events are ignored
# ---------------------------------------------------------------------------
def test_non_ha_events_not_routed(tmp_path, monkeypatch):
sup = _setup_supervisor(tmp_path, monkeypatch)
events_dir = tmp_path / "events"
for etype in ("service_unhealthy", "containers_not_running", "node_online", "deployment_failed"):
e = _make_event(etype, service="mosquitto")
e["type"] = etype
_write_event(events_dir, e)
sup._process_ha_events()
pending_files = list((tmp_path / "actions" / "pending").glob("*.json"))
assert pending_files == [], "Non-HA events should not generate actions via HA path"

Some files were not shown because too many files have changed in this diff Show more