homelab-codex-ws/docs/stability-agent-rollout.md

2.8 KiB

Stability Agent Multi-Node Rollout

Architecture Summary

The stability-agent is a lightweight Python service that monitors node health (disk, Docker containers, Tailscale, MQTT) and publishes state to a central Redis instance running on PIHA.

  • Source: services/stability-agent
  • State Path: /opt/homelab/state
  • Events Path: /opt/homelab/events
  • Redis Target: 100.108.208.3:6379 (PIHA)

Why UI only showed CHELSTY

Previously, the stability-agent had NODE_NAME defaulted to chelsty and was only deployed there. The Agent System UI materializer on PIHA filters nodes based on the Redis keys homelab:nodes:<NODE_NAME>. Without other agents publishing their specific NODE_NAME, the UI remained limited to the single active node.

Deployment Commands

Use the helper script to generate commands:

./scripts/deploy/deploy-stability-agent.sh <node-name>

PIHA

cd ~/homelab-codex-ws
git pull
cd services/stability-agent
NODE_NAME=piha REDIS_HOST=100.108.208.3 REDIS_PORT=6379 REDIS_ENABLED=true docker compose up -d --build --force-recreate

CHELSTY

cd ~/homelab-codex-ws
git pull
cd services/stability-agent
NODE_NAME=chelsty REDIS_HOST=100.108.208.3 REDIS_PORT=6379 REDIS_ENABLED=true docker compose up -d --build --force-recreate

SOLARIA

cd ~/homelab-codex-ws
git pull
cd services/stability-agent
NODE_NAME=solaria REDIS_HOST=100.108.208.3 REDIS_PORT=6379 REDIS_ENABLED=true docker compose up -d --build --force-recreate

VPS

cd ~/homelab-codex-ws
git pull
cd services/stability-agent
NODE_NAME=vps REDIS_HOST=100.108.208.3 REDIS_PORT=6379 REDIS_ENABLED=true docker compose up -d --build --force-recreate

SATURN (Optional)

Saturn is the orchestrator and can optionally run the stability-agent. If deployed, follow the same pattern with NODE_NAME=saturn.

Verification (on PIHA)

Verify Redis keys:

docker exec agent-system-redis redis-cli KEYS 'homelab:nodes:*'
docker exec agent-system-redis redis-cli HGETALL homelab:nodes:<node-name>

Verify Web UI backend:

curl -s http://127.0.0.1:18180/nodes
curl -k https://agents.okit.pl/nodes

Troubleshooting

  • Redis empty after compose down: The agent-system-redis on PIHA uses transient storage if not configured with a volume. If it restarts, agents must republish their state (they do this automatically every CHECK_INTERVAL).
  • Secrets: .env files and local secrets are not committed to the repo. Ensure MQTT_HOST and other specific secrets are set via overrides if needed.
  • Telegram: Telegram bot notifications can remain disabled if TELEGRAM_BOT_TOKEN is absent.
  • Docker Socket: If the agent reports unavailable for Docker, ensure /var/run/docker.sock is mounted and the user has permissions.