Two independent fixes for the false-alarm storm caused by race-condition
reads of truncated world state files:
1. Atomic writes: _atomic_write_json (write→fsync→os.replace) replaces
all bare open('w')+json.dump calls in supervisor and executor, so the
action-file pipeline is never visible in a half-written state.
2. Resilient loader: _load_actual_state now returns False when any world
state file fails to parse (empty or truncated mid-write). reconcile()
skips the entire drift check on False instead of treating {} as "all
services missing". actual_state retains its last-known-good values so
a single bad cycle does not wipe accumulated context.
Before: parse error → raw[key]={} → all desired services missing →
wall of redeploy actions → drift_resolved_auto churn on next cycle.
After: parse error → WARNING logged → cycle skipped → no actions.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
||
|---|---|---|
| backups/zigbee | ||
| docs | ||
| dotfiles | ||
| hosts | ||
| inventory | ||
| scripts | ||
| services | ||
| .codex | ||
| .gitignore | ||
| CLAUDE.md | ||
| codex_context | ||
| codex_context.yaml | ||
| deploy_agent.py | ||
| ollama_client.py | ||
| README.md | ||
| start-aider.sh | ||
| start-codex.sh | ||
| sync-context.sh | ||
| tech-debt.md | ||
| update-context.md | ||
Homelab Codex
GitOps-lite orchestration for a distributed homelab environment.
Architecture
The homelab consists of several nodes connected via a Tailscale internal mesh.
| Host | Role | Description |
|---|---|---|
| SATURN | Primary Node | Development, orchestration, and git source of truth (commit node). |
| SOLARIA | Compute Node | GPU, inference, and heavy compute workloads. |
| PIHA | Infra Node | Core infrastructure services, automation, and monitoring. |
| VPS | Edge Node | Public ingress, reverse proxy, and edge services. |
Agent System
The homelab uses a multi-agent orchestration model with human-in-the-loop for destructive actions:
| Agent | Node | Role |
|---|---|---|
| stability-agent | all nodes | Per-node watchdog — monitors Docker, disk, Tailscale, MQTT; emits events |
| node-agent | all nodes | Publishes container health events to Redis pub/sub |
| observer | VPS | Synthesizes world state from events into /opt/homelab/world/*.json |
| supervisor | VPS | Detects drift between desired and actual state; writes pending actions |
| planner-agent | SOLARIA | LLM-powered diagnosis — listens to Redis, proposes remediation actions |
| executor | VPS | Executes actions only after operator approval |
| operator-ui + telegram-bot | VPS / PIHA | Operator reviews and approves/rejects pending actions |
Action approval flow: pending/ → operator approves → approved/ → executor runs.
Repository Structure
docs/: Infrastructure Standards and Deployment Conventions.hosts/: Host-specific configurations and service assignments.services/: Reusable Docker Compose service definitions.scripts/: Deployment and management scripts.
Getting Started
- Standardization: Follow the Infrastructure Standards.
- Deployment: See Deployment Conventions for how to roll out changes.
- SATURN: Remember that SATURN is the only node where commits should be made.
Documentation Index
- Infrastructure Standards
- Agent Operating Procedures (For AI/Non-Human Agents)
- Deployment Conventions
- Hardware
- Networking
- Services
- Node Capabilities
- Action Model
Note: This repository documents the state of the homelab. Runtime state lives outside the repository in /opt/homelab.