New checks: - SystemHealthCheck (15min interval): detects newly-failing HA integrations via /api/system_health snapshot diff; transition-based dedup (ok→error fires, sustained error silent, error→ok clears alert) - UpdatesAvailableCheck (daily cron 09:00): per-update ha_update_available events with 7-day dedup; release notes truncated at 2000 chars - UpdatesDigestCheck (Sunday cron 09:00): single digest event with all pending updates; weekly ISO-week dedup, independent of daily dedup key - AutomationFailuresCheck (30min interval): detects automations with N consecutive failures (default 3) via /api/trace/automation/<id>; 6h cooldown per automation Phase 3 flag fixes: - Flag #1 (since field): UnavailableEntitiesCheck now uses min(state.last_changed, baseline.first_seen) as effective "since", giving accurate duration when agent was offline at entity's first fail - Flag #3 (registry cache): HAClient.get_entity_registry() caches response in-process with configurable TTL (default 300s); avoids repeated API calls across concurrent check cycles; invalidate_registry_cache() for manual invalidation Storage: system_health_snapshot table (component, last_status, last_seen_at, payload) created automatically on next Storage.open() call Config additions (all with defaults): entity_registry_cache_ttl=300, system_health_check_interval=900, automation_check_interval=1800, automation_failure_threshold=3, updates_check_hour=9, updates_check_minute=0, updates_cooldown_days=7 Tests: 95 unit tests pass (49 new), 13 integration tests pass (9 new); 3 skipped (live-HA token not set in CI) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|---|---|---|
| backups/zigbee | ||
| docs | ||
| dotfiles | ||
| hosts | ||
| inventory | ||
| scripts | ||
| services | ||
| .codex | ||
| .gitignore | ||
| CLAUDE.md | ||
| codex_context | ||
| codex_context.yaml | ||
| deploy_agent.py | ||
| ollama_client.py | ||
| README.md | ||
| start-aider.sh | ||
| start-codex.sh | ||
| sync-context.sh | ||
| tech-debt.md | ||
| update-context.md | ||
Homelab Codex
GitOps-lite orchestration for a distributed homelab environment.
Architecture
The homelab consists of several nodes connected via a Tailscale internal mesh.
| Host | Role | Description |
|---|---|---|
| SATURN | Primary Node | Development, orchestration, and git source of truth (commit node). |
| SOLARIA | Compute Node | GPU, inference, and heavy compute workloads. |
| PIHA | Infra Node | Core infrastructure services, automation, and monitoring. |
| VPS | Edge Node | Public ingress, reverse proxy, and edge services. |
Agent System
The homelab uses a multi-agent orchestration model with human-in-the-loop for destructive actions:
| Agent | Node | Role |
|---|---|---|
| stability-agent | all nodes | Per-node watchdog — monitors Docker, disk, Tailscale, MQTT; emits events |
| node-agent | all nodes | Publishes container health events to Redis pub/sub |
| observer | VPS | Synthesizes world state from events into /opt/homelab/world/*.json |
| supervisor | VPS | Detects drift between desired and actual state; writes pending actions |
| planner-agent | SOLARIA | LLM-powered diagnosis — listens to Redis, proposes remediation actions |
| executor | VPS | Executes actions only after operator approval |
| operator-ui + telegram-bot | VPS / PIHA | Operator reviews and approves/rejects pending actions |
Action approval flow: pending/ → operator approves → approved/ → executor runs.
Repository Structure
docs/: Infrastructure Standards and Deployment Conventions.hosts/: Host-specific configurations and service assignments.services/: Reusable Docker Compose service definitions.scripts/: Deployment and management scripts.
Getting Started
- Standardization: Follow the Infrastructure Standards.
- Deployment: See Deployment Conventions for how to roll out changes.
- SATURN: Remember that SATURN is the only node where commits should be made.
Documentation Index
- Infrastructure Standards
- Agent Operating Procedures (For AI/Non-Human Agents)
- Deployment Conventions
- Hardware
- Networking
- Services
- Node Capabilities
- Action Model
Note: This repository documents the state of the homelab. Runtime state lives outside the repository in /opt/homelab.