Find a file

Oskar Kapala f5dcefc752 fix(observer): robust incident lifecycle + orphan auto-resolve Two root causes for stale "active" incidents on the dashboard: 1. TypeError bug in _prune_stale_world: last_occurrence / resolved_at can be an ISO-8601 string (stability-agent via events.py) or a Unix int (node-agent). The previous session's auto-resolve did plain `time.time() - last_occ` which raises TypeError for strings, silently preventing _save_world() from being called and leaving incidents perpetually "active" on disk. Fix: add _parse_ts(ts) -> float that handles int, float, and ISO-8601 strings uniformly. All timestamp arithmetic now goes through it; returns 0.0 on None / garbage to keep comparisons safe. 2. Orphaned active incidents: _resolve_incident clears service["incident_id"] and marks the incident "resolved" in memory, but if incidents.json was truncated mid-write (pre-atomic-write era), the observer loaded it at next startup with status="active" and no service entry pointing to it. No code ever touched these orphans again. Fix: _prune_stale_world now runs two cleanup passes each cycle: - Case 1 (healthy-linked): service.status=="healthy" AND incident_id still set → resolve immediately (service cannot have active incident) - Case 2 (orphaned): active incident with no service link AND last_occurrence > 5 min ago → resolve (5-min guard for creation race) Both cases are wrapped in try/except so a bug here never crashes the observer loop or blocks _save_world. Also fixes the 7-day stale-incident prune to use _parse_ts so ISO-string resolved_at values are handled correctly. 3. Operator UI: current_incidents() now filters to status=="active" only. Resolved incidents were previously included in the /incidents endpoint, making the dashboard show a wall of historical records as if active. Nocturnal job investigation: _cleanup_control_plane_fs in node-agent runs every 60s on VPS (not midnight-specific); it reads observer_checkpoint.json (now written atomically) and deletes old event files. No non-atomic writes found. Midnight clustering was likely external (logrotate / OS flush); the supervisor's resilient loader already handles such transient issues. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>		2026-06-03 14:29:12 +02:00
backups/zigbee	Add Zigbee coordinator backup	2026-05-14 18:24:26 +02:00
docs	docs: add planner-agent docs and session summary 2026-05-27	2026-05-27 22:35:59 +02:00
dotfiles	add shared zshrc	2026-05-10 20:52:44 +02:00
hosts	feat(piha): brain-watchdog — external watchdog for control-plane	2026-06-01 17:54:36 +02:00
inventory	feat(piha): brain-watchdog — external watchdog for control-plane	2026-06-01 17:54:36 +02:00
scripts	fix(observer): robust incident lifecycle + orphan auto-resolve	2026-06-03 14:29:12 +02:00
services	fix(observer): robust incident lifecycle + orphan auto-resolve	2026-06-03 14:29:12 +02:00
.codex	Document current homelab state	2026-04-15 17:37:25 +02:00
.gitignore	chore: gitignore *.egg-info, remove committed egg-info	2026-05-29 12:26:57 +02:00
CLAUDE.md	docs(claude): add Definition of Done for services (smoke test + pytest)	2026-06-01 20:38:39 +02:00
codex_context	Add session context state	2026-04-20 22:10:39 +02:00
codex_context.yaml	add shared context lock	2026-05-05 17:25:50 +02:00
deploy_agent.py	Add deploy escalation output	2026-04-22 22:08:26 +02:00
ollama_client.py	Initial shared homelab agent workspace	2026-05-03 19:37:40 +02:00
README.md	docs: add planner-agent docs and session summary 2026-05-27	2026-05-27 22:35:59 +02:00
start-aider.sh	Initial shared homelab agent workspace	2026-05-03 19:37:40 +02:00
start-codex.sh	Initial shared homelab agent workspace	2026-05-03 19:37:40 +02:00
sync-context.sh	add shared context lock	2026-05-05 17:25:50 +02:00
tech-debt.md	docs: add tech-debt.md, forgejo_runner temp disabled	2026-05-21 10:37:42 +02:00
update-context.md	Initial shared homelab agent workspace	2026-05-03 19:37:40 +02:00

README.md

Homelab Codex

GitOps-lite orchestration for a distributed homelab environment.

Architecture

The homelab consists of several nodes connected via a Tailscale internal mesh.

Host	Role	Description
SATURN	Primary Node	Development, orchestration, and git source of truth (commit node).
SOLARIA	Compute Node	GPU, inference, and heavy compute workloads.
PIHA	Infra Node	Core infrastructure services, automation, and monitoring.
VPS	Edge Node	Public ingress, reverse proxy, and edge services.

Agent System

The homelab uses a multi-agent orchestration model with human-in-the-loop for destructive actions:

Agent	Node	Role
stability-agent	all nodes	Per-node watchdog — monitors Docker, disk, Tailscale, MQTT; emits events
node-agent	all nodes	Publishes container health events to Redis pub/sub
observer	VPS	Synthesizes world state from events into `/opt/homelab/world/*.json`
supervisor	VPS	Detects drift between desired and actual state; writes `pending` actions
planner-agent	SOLARIA	LLM-powered diagnosis — listens to Redis, proposes remediation actions
executor	VPS	Executes actions only after operator approval
operator-ui + telegram-bot	VPS / PIHA	Operator reviews and approves/rejects pending actions

Action approval flow: pending/ → operator approves → approved/ → executor runs.

Repository Structure

docs/: Infrastructure Standards and Deployment Conventions.
hosts/: Host-specific configurations and service assignments.
services/: Reusable Docker Compose service definitions.
scripts/: Deployment and management scripts.

Getting Started

Standardization: Follow the Infrastructure Standards.
Deployment: See Deployment Conventions for how to roll out changes.
SATURN: Remember that SATURN is the only node where commits should be made.

Documentation Index

Infrastructure Standards
Agent Operating Procedures (For AI/Non-Human Agents)
Deployment Conventions
Hardware
Networking
Services
Node Capabilities
Action Model

Note: This repository documents the state of the homelab. Runtime state lives outside the repository in /opt/homelab.