scripts/monitor/health-monitor.sh (new):
- Standalone bash health monitor: disk/RAM/CPU checks + docker container health
- Per-node-type cleanup policy enforced:
lte_node (chelsty-infra, chelsty-ha): NO cleanup, no docker ops
sd_card (piha, saturn): dangling images + containers, rate-limited once/24h
ai_node (solaria): dangling + containers + build cache, NEVER -a
standard (vps): dangling + containers + build cache + CP filesystem rotation
- VPS filesystem rotation: completed/failed actions >7d, deploy logs >30d,
events >3d AND past observer checkpoint
- Emits structured JSON events (node_health, disk_pressure, high_memory, high_cpu,
containers_not_running, healthcheck_failed)
services/node-agent/ (new):
- Python daemon (node_agent.py): same policy as bash script, Docker SDK
for container checks and cleanup, /proc for system metrics
- Optional event shipping to VPS via rsync+SSH (VPS_EVENTS_HOST env var)
- Dockerfile: python:3.11-slim + openssh-client + rsync + docker>=6.0
- docker-compose.yml: mounts docker socket, /opt/homelab, repo read-only
observer.py:
- Handle node_health: update node status + disk/mem/cpu metrics, clear disk_pressure
- Handle disk_pressure: record severity on node, clear when healthy
- Handle high_memory / high_cpu: record pressure level for correlation
supervisor.py:
- Add NO_DISK_CLEANUP_NODES = {chelsty-infra, chelsty-ha}
- reconcile() step 3: generate disk_cleanup actions for nodes with high disk pressure
- _generate_disk_cleanup_recommendation(): stable ID disk-cleanup-{node},
checks all active states, risk=guarded (operator approval required)
executor.py:
- Handle disk_cleanup action type via _execute_disk_cleanup()
- Commands come from action payload; safety gate rejects any command touching
/opt/homelab/data/, /opt/homelab/config/, /opt/homelab/state/, or rm -rf /
hosts/*/services.yaml:
- Rename stability-agent -> node-agent on piha, vps, solaria, chelsty-infra
- Add node-agent to chelsty-ha (previously missing)
- Add cleanup policy notes to LTE node comments
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
||
|---|---|---|
| backups/zigbee | ||
| docs | ||
| dotfiles | ||
| hosts | ||
| inventory | ||
| scripts | ||
| services | ||
| .codex | ||
| .gitignore | ||
| CLAUDE.md | ||
| codex_context | ||
| codex_context.yaml | ||
| deploy_agent.py | ||
| ollama_client.py | ||
| README.md | ||
| start-aider.sh | ||
| start-codex.sh | ||
| sync-context.sh | ||
| tech-debt.md | ||
| update-context.md | ||
Homelab Codex
GitOps-lite orchestration for a distributed homelab environment.
Architecture
The homelab consists of several nodes connected via a Tailscale internal mesh.
| Host | Role | Description |
|---|---|---|
| SATURN | Primary Node | Development, orchestration, and git source of truth (commit node). |
| SOLARIA | Compute Node | GPU, inference, and heavy compute workloads. |
| PIHA | Infra Node | Core infrastructure services, automation, and monitoring. |
| VPS | Edge Node | Public ingress, reverse proxy, and edge services. |
Repository Structure
docs/: Infrastructure Standards and Deployment Conventions.hosts/: Host-specific configurations and service assignments.services/: Reusable Docker Compose service definitions.scripts/: Deployment and management scripts.
Getting Started
- Standardization: Follow the Infrastructure Standards.
- Deployment: See Deployment Conventions for how to roll out changes.
- SATURN: Remember that SATURN is the only node where commits should be made.
Documentation Index
- Infrastructure Standards
- Agent Operating Procedures (For AI/Non-Human Agents)
- Deployment Conventions
- Hardware
- Networking
- Services
- Node Capabilities
- Action Model
Note: This repository documents the state of the homelab. Runtime state lives outside the repository in /opt/homelab.