homelab-codex-ws/docs/deployment.md

140 lines
7.4 KiB
Markdown

# Deployment Conventions
This document describes the GitOps-lite deployment process for the homelab.
## Principles
1. **Git as Source of Truth**: All infrastructure definitions (Docker Compose, configurations) are stored in Git.
2. **Unidirectional Flow**: Changes flow from **SATURN** (commit node) to execution nodes.
3. **Lightweight**: No complex orchestrators (no Kubernetes). Use `docker compose` and simple shell scripts.
4. **Tailscale Mesh**: All hosts are connected via Tailscale, allowing secure communication without public port exposure.
5. **Host Autonomy**: Services that must operate during WAN or Git outages keep their runtime dependencies on the execution node or local LAN.
## Staged Deployment Framework
The homelab uses a modularized staged deployment framework located at `scripts/deploy/deploy.sh`. This script is designed to be resumable, stage-aware, and observable, with core logic split into maintainable libraries in `scripts/lib/`.
### Runtime Architecture
The runtime consists of:
- `deploy.sh`: Orchestration entrypoint.
- `lib/log.sh`: Logging and structured output.
- `lib/state.sh`: Deployment state tracking and stage persistence.
- `lib/inventory.sh`: Reliable host and service discovery (Python-based YAML parsing).
- `lib/compose.sh`: Docker Compose operations.
- `lib/diagnostics.sh`: Post-failure analysis and summary generation.
### Deployment Stages
1. **prepare**: Pulls the latest changes from Git, validates inventory, and prepares the local environment. It is tolerant of network failures to support intermittently connected nodes like CHELSTY.
2. **validate**: Ensures all required service definitions and metadata are present.
3. **deploy**: Executes `docker compose` commands for all assigned services. Supports `.env` files and `docker-compose.override.yml` under `/opt/homelab/config/<service>/`.
4. **verify**: Executes service-specific `healthcheck.sh` scripts or checks container status.
5. **diagnose**: Automatically triggered on failure; collects container status and logs for troubleshooting.
6. **complete**: Finalizes the deployment and marks the state as finished.
### State Tracking and Logging
- **State**: Local node state is tracked in `/opt/homelab/state/deploy/current_stage`. The last successfully processed service in the `deploy` stage is tracked in `last_service` to support granular resumption.
- **Logs**: Detailed execution logs are stored in `/opt/homelab/logs/deploy/deploy_<timestamp>.log`. Structured log entries prefixed with `[STRUCT]` provide machine-parseable event data.
### Resume Semantics
If a deployment is interrupted (e.g., due to LTE disconnect on CHELSTY):
1. Rerun the script with the `--resume` flag: `scripts/deploy/deploy.sh --resume`.
2. The script identifies the last incomplete stage using deterministic markers (`/opt/homelab/state/deploy/stage_<name>_complete`) and continues from the exact failure point.
3. In the `deploy` stage, it specifically resumes from the first service that was not successfully completed, skipping those already up.
4. Repeated runs are safe and idempotent; completed stages are not re-executed unless the resume flag is omitted (which clears state for a fresh run).
### Diagnostics and Troubleshooting
The runtime is designed to fail predictably and provide immediate feedback:
- **Automatic Diagnostics**: If any stage fails, `collect_diagnostics` is triggered to capture system state and container logs into `/opt/homelab/logs/deploy/diagnostics_<timestamp>.txt`.
- **Deployment Summary**: Every run concludes with a concise summary showing the host status, last stage reached, and log locations.
- **Offline Resilience**: The `prepare` stage handles `git pull` failures gracefully, allowing deployment from local cache during network instability.
### Operational Semantics
Deployment is **hybrid**:
- **SATURN** acts as the orchestrator and source of truth.
- **Nodes** execute the deployment locally using the `deploy.sh` script.
- Human-in-the-loop is required for triggering and confirming deployments.
### Recovery Workflow
If a deployment fails:
1. Run `deploy.sh --stage diagnose` to identify the issue.
2. Use the `recover-node` AI prompt to analyze logs and get recommendations.
3. Fix the issue (e.g., update a secret in `.env`) and run `deploy.sh --resume`.
## Onboarding New Nodes
Refer to `inventory/templates/how_to_add_new_node.yaml` for a detailed guide on adding new hardware to the mesh. The general flow is:
1. Define node in `hosts/` and `inventory/topology.yaml` on SATURN.
2. Bootstrap the node (Docker, Tailscale, Git).
3. Run the staged deployment framework starting with `prepare`.
## Host-Local Overrides
If a service requires host-specific configuration (e.g., unique device paths for GPUs on SOLARIA):
1. Create a `docker-compose.override.yml` in `/opt/homelab/config/<service>/`.
2. The deployment script should include this override if it exists.
For CHELSTY Home Assistant infrastructure, host-local configuration is the
authority for runtime identity, secrets, and local device endpoints:
- Home Assistant config: `/opt/homelab/config/homeassistant`
- Zigbee2MQTT config: `/opt/homelab/config/zigbee2mqtt`
- Mosquitto config: `/opt/homelab/config/mosquitto`
CHELSTY services must not require SATURN, VPS, or Forgejo to be reachable after
deployment has completed. Docker Compose definitions can still come from Git,
but Home Assistant automation, Zigbee control, and MQTT messaging must continue
locally while LTE or Tailscale connectivity is unavailable.
## Exposure Classes
Service inventory may declare one of these exposure classes:
- `local-only`: bind only to host, LAN, or container networks. This is the default for Zigbee2MQTT and Mosquitto.
- `tailscale-internal`: reachable over Tailscale only. This is appropriate for Home Assistant remote administration.
- `public`: reachable from the public internet through a deliberate ingress path, normally the VPS edge role.
Public exposure is not implied by a service existing in Git. It must be explicit
in host inventory and ingress configuration.
## CHELSTY Home Automation Deployment Notes
CHELSTY remains a Docker Compose execution node. No Kubernetes, Helm, Ansible,
or additional orchestration layer is required for Home Assistant infrastructure.
The SLZB-06U coordinator is network-connected over Ethernet or WiFi. Compose
files and host overrides should configure Zigbee2MQTT for a TCP/network
coordinator endpoint, not a USB serial device. Avoid `/dev/ttyUSB0` mappings.
Runtime paths follow the standard layout:
- `/opt/homelab/data/homeassistant`
- `/opt/homelab/config/homeassistant`
- `/opt/homelab/logs/homeassistant`
- `/opt/homelab/data/zigbee2mqtt`
- `/opt/homelab/config/zigbee2mqtt`
- `/opt/homelab/logs/zigbee2mqtt`
- `/opt/homelab/data/mosquitto`
- `/opt/homelab/config/mosquitto`
- `/opt/homelab/logs/mosquitto`
Recommended backup coverage:
- Home Assistant config and persistent data before upgrades or major integration changes.
- Zigbee2MQTT config, database, coordinator backup files, and Zigbee network key material.
- SLZB-06U firmware version, exported configuration, network address reservation, and coordinator state.
- Mosquitto config, ACL/password files, persistence data, and bridge configuration if enabled.
## Secrets Management
- **Do NOT commit secrets to Git.**
- Secrets should be placed in `/opt/homelab/config/<service>/.env` on the target host.
- The deployment script should ensure these are sourced by Docker Compose.