Find a file
Oskar Kapala f5dcefc752 fix(observer): robust incident lifecycle + orphan auto-resolve
Two root causes for stale "active" incidents on the dashboard:

1. TypeError bug in _prune_stale_world: last_occurrence / resolved_at
   can be an ISO-8601 string (stability-agent via events.py) or a Unix
   int (node-agent).  The previous session's auto-resolve did plain
   `time.time() - last_occ` which raises TypeError for strings,
   silently preventing _save_world() from being called and leaving
   incidents perpetually "active" on disk.

   Fix: add _parse_ts(ts) -> float that handles int, float, and
   ISO-8601 strings uniformly. All timestamp arithmetic now goes through
   it; returns 0.0 on None / garbage to keep comparisons safe.

2. Orphaned active incidents: _resolve_incident clears service["incident_id"]
   and marks the incident "resolved" in memory, but if incidents.json was
   truncated mid-write (pre-atomic-write era), the observer loaded it at
   next startup with status="active" and no service entry pointing to it.
   No code ever touched these orphans again.

   Fix: _prune_stale_world now runs two cleanup passes each cycle:
   - Case 1 (healthy-linked): service.status=="healthy" AND incident_id
     still set → resolve immediately (service cannot have active incident)
   - Case 2 (orphaned): active incident with no service link AND
     last_occurrence > 5 min ago → resolve (5-min guard for creation race)

   Both cases are wrapped in try/except so a bug here never crashes the
   observer loop or blocks _save_world.

   Also fixes the 7-day stale-incident prune to use _parse_ts so
   ISO-string resolved_at values are handled correctly.

3. Operator UI: current_incidents() now filters to status=="active" only.
   Resolved incidents were previously included in the /incidents endpoint,
   making the dashboard show a wall of historical records as if active.

Nocturnal job investigation: _cleanup_control_plane_fs in node-agent runs
every 60s on VPS (not midnight-specific); it reads observer_checkpoint.json
(now written atomically) and deletes old event files. No non-atomic writes
found. Midnight clustering was likely external (logrotate / OS flush);
the supervisor's resilient loader already handles such transient issues.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-03 14:29:12 +02:00
backups/zigbee Add Zigbee coordinator backup 2026-05-14 18:24:26 +02:00
docs docs: add planner-agent docs and session summary 2026-05-27 2026-05-27 22:35:59 +02:00
dotfiles add shared zshrc 2026-05-10 20:52:44 +02:00
hosts feat(piha): brain-watchdog — external watchdog for control-plane 2026-06-01 17:54:36 +02:00
inventory feat(piha): brain-watchdog — external watchdog for control-plane 2026-06-01 17:54:36 +02:00
scripts fix(observer): robust incident lifecycle + orphan auto-resolve 2026-06-03 14:29:12 +02:00
services fix(observer): robust incident lifecycle + orphan auto-resolve 2026-06-03 14:29:12 +02:00
.codex Document current homelab state 2026-04-15 17:37:25 +02:00
.gitignore chore: gitignore *.egg-info, remove committed egg-info 2026-05-29 12:26:57 +02:00
CLAUDE.md docs(claude): add Definition of Done for services (smoke test + pytest) 2026-06-01 20:38:39 +02:00
codex_context Add session context state 2026-04-20 22:10:39 +02:00
codex_context.yaml add shared context lock 2026-05-05 17:25:50 +02:00
deploy_agent.py Add deploy escalation output 2026-04-22 22:08:26 +02:00
ollama_client.py Initial shared homelab agent workspace 2026-05-03 19:37:40 +02:00
README.md docs: add planner-agent docs and session summary 2026-05-27 2026-05-27 22:35:59 +02:00
start-aider.sh Initial shared homelab agent workspace 2026-05-03 19:37:40 +02:00
start-codex.sh Initial shared homelab agent workspace 2026-05-03 19:37:40 +02:00
sync-context.sh add shared context lock 2026-05-05 17:25:50 +02:00
tech-debt.md docs: add tech-debt.md, forgejo_runner temp disabled 2026-05-21 10:37:42 +02:00
update-context.md Initial shared homelab agent workspace 2026-05-03 19:37:40 +02:00

Homelab Codex

GitOps-lite orchestration for a distributed homelab environment.

Architecture

The homelab consists of several nodes connected via a Tailscale internal mesh.

Host Role Description
SATURN Primary Node Development, orchestration, and git source of truth (commit node).
SOLARIA Compute Node GPU, inference, and heavy compute workloads.
PIHA Infra Node Core infrastructure services, automation, and monitoring.
VPS Edge Node Public ingress, reverse proxy, and edge services.

Agent System

The homelab uses a multi-agent orchestration model with human-in-the-loop for destructive actions:

Agent Node Role
stability-agent all nodes Per-node watchdog — monitors Docker, disk, Tailscale, MQTT; emits events
node-agent all nodes Publishes container health events to Redis pub/sub
observer VPS Synthesizes world state from events into /opt/homelab/world/*.json
supervisor VPS Detects drift between desired and actual state; writes pending actions
planner-agent SOLARIA LLM-powered diagnosis — listens to Redis, proposes remediation actions
executor VPS Executes actions only after operator approval
operator-ui + telegram-bot VPS / PIHA Operator reviews and approves/rejects pending actions

Action approval flow: pending/ → operator approves → approved/ → executor runs.

Repository Structure

Getting Started

  1. Standardization: Follow the Infrastructure Standards.
  2. Deployment: See Deployment Conventions for how to roll out changes.
  3. SATURN: Remember that SATURN is the only node where commits should be made.

Documentation Index


Note: This repository documents the state of the homelab. Runtime state lives outside the repository in /opt/homelab.