oskar/homelab-codex-ws

Fork 0

Oskar Kapala b524a3886a Harden deployment runtime framework

2026-05-11 21:20:13 +02:00

7.4 KiB

Raw Permalink Blame History

Deployment Conventions

This document describes the GitOps-lite deployment process for the homelab.

Principles

Git as Source of Truth: All infrastructure definitions (Docker Compose, configurations) are stored in Git.
Unidirectional Flow: Changes flow from SATURN (commit node) to execution nodes.
Lightweight: No complex orchestrators (no Kubernetes). Use docker compose and simple shell scripts.
Tailscale Mesh: All hosts are connected via Tailscale, allowing secure communication without public port exposure.
Host Autonomy: Services that must operate during WAN or Git outages keep their runtime dependencies on the execution node or local LAN.

Staged Deployment Framework

The homelab uses a modularized staged deployment framework located at scripts/deploy/deploy.sh. This script is designed to be resumable, stage-aware, and observable, with core logic split into maintainable libraries in scripts/lib/.

Runtime Architecture

The runtime consists of:

deploy.sh: Orchestration entrypoint.
lib/log.sh: Logging and structured output.
lib/state.sh: Deployment state tracking and stage persistence.
lib/inventory.sh: Reliable host and service discovery (Python-based YAML parsing).
lib/compose.sh: Docker Compose operations.
lib/diagnostics.sh: Post-failure analysis and summary generation.

Deployment Stages

prepare: Pulls the latest changes from Git, validates inventory, and prepares the local environment. It is tolerant of network failures to support intermittently connected nodes like CHELSTY.
validate: Ensures all required service definitions and metadata are present.
deploy: Executes docker compose commands for all assigned services. Supports .env files and docker-compose.override.yml under /opt/homelab/config/<service>/.
verify: Executes service-specific healthcheck.sh scripts or checks container status.
diagnose: Automatically triggered on failure; collects container status and logs for troubleshooting.
complete: Finalizes the deployment and marks the state as finished.

State Tracking and Logging

State: Local node state is tracked in /opt/homelab/state/deploy/current_stage. The last successfully processed service in the deploy stage is tracked in last_service to support granular resumption.
Logs: Detailed execution logs are stored in /opt/homelab/logs/deploy/deploy_<timestamp>.log. Structured log entries prefixed with [STRUCT] provide machine-parseable event data.

Resume Semantics

If a deployment is interrupted (e.g., due to LTE disconnect on CHELSTY):

Rerun the script with the --resume flag: scripts/deploy/deploy.sh --resume.
The script identifies the last incomplete stage using deterministic markers (/opt/homelab/state/deploy/stage_<name>_complete) and continues from the exact failure point.
In the deploy stage, it specifically resumes from the first service that was not successfully completed, skipping those already up.
Repeated runs are safe and idempotent; completed stages are not re-executed unless the resume flag is omitted (which clears state for a fresh run).

Diagnostics and Troubleshooting

The runtime is designed to fail predictably and provide immediate feedback:

Automatic Diagnostics: If any stage fails, collect_diagnostics is triggered to capture system state and container logs into /opt/homelab/logs/deploy/diagnostics_<timestamp>.txt.
Deployment Summary: Every run concludes with a concise summary showing the host status, last stage reached, and log locations.
Offline Resilience: The prepare stage handles git pull failures gracefully, allowing deployment from local cache during network instability.

Operational Semantics

Deployment is hybrid:

SATURN acts as the orchestrator and source of truth.
Nodes execute the deployment locally using the deploy.sh script.
Human-in-the-loop is required for triggering and confirming deployments.

Recovery Workflow

If a deployment fails:

Run deploy.sh --stage diagnose to identify the issue.
Use the recover-node AI prompt to analyze logs and get recommendations.
Fix the issue (e.g., update a secret in .env) and run deploy.sh --resume.

Onboarding New Nodes

Refer to inventory/templates/how_to_add_new_node.yaml for a detailed guide on adding new hardware to the mesh. The general flow is:

Define node in hosts/ and inventory/topology.yaml on SATURN.
Bootstrap the node (Docker, Tailscale, Git).
Run the staged deployment framework starting with prepare.

Host-Local Overrides

If a service requires host-specific configuration (e.g., unique device paths for GPUs on SOLARIA):

Create a docker-compose.override.yml in /opt/homelab/config/<service>/.
The deployment script should include this override if it exists.

For CHELSTY Home Assistant infrastructure, host-local configuration is the authority for runtime identity, secrets, and local device endpoints:

Home Assistant config: /opt/homelab/config/homeassistant
Zigbee2MQTT config: /opt/homelab/config/zigbee2mqtt
Mosquitto config: /opt/homelab/config/mosquitto

CHELSTY services must not require SATURN, VPS, or Forgejo to be reachable after deployment has completed. Docker Compose definitions can still come from Git, but Home Assistant automation, Zigbee control, and MQTT messaging must continue locally while LTE or Tailscale connectivity is unavailable.

Exposure Classes

Service inventory may declare one of these exposure classes:

local-only: bind only to host, LAN, or container networks. This is the default for Zigbee2MQTT and Mosquitto.
tailscale-internal: reachable over Tailscale only. This is appropriate for Home Assistant remote administration.
public: reachable from the public internet through a deliberate ingress path, normally the VPS edge role.

Public exposure is not implied by a service existing in Git. It must be explicit in host inventory and ingress configuration.

CHELSTY Home Automation Deployment Notes

CHELSTY remains a Docker Compose execution node. No Kubernetes, Helm, Ansible, or additional orchestration layer is required for Home Assistant infrastructure.

The SLZB-06U coordinator is network-connected over Ethernet or WiFi. Compose files and host overrides should configure Zigbee2MQTT for a TCP/network coordinator endpoint, not a USB serial device. Avoid /dev/ttyUSB0 mappings.

Runtime paths follow the standard layout:

/opt/homelab/data/homeassistant
/opt/homelab/config/homeassistant
/opt/homelab/logs/homeassistant
/opt/homelab/data/zigbee2mqtt
/opt/homelab/config/zigbee2mqtt
/opt/homelab/logs/zigbee2mqtt
/opt/homelab/data/mosquitto
/opt/homelab/config/mosquitto
/opt/homelab/logs/mosquitto

Recommended backup coverage:

Home Assistant config and persistent data before upgrades or major integration changes.
Zigbee2MQTT config, database, coordinator backup files, and Zigbee network key material.
SLZB-06U firmware version, exported configuration, network address reservation, and coordinator state.
Mosquitto config, ACL/password files, persistence data, and bridge configuration if enabled.

Secrets Management

Do NOT commit secrets to Git.
Secrets should be placed in /opt/homelab/config/<service>/.env on the target host.
The deployment script should ensure these are sourced by Docker Compose.

7.4 KiB Raw Permalink Blame History