The homelab uses a modularized staged deployment framework located at `scripts/deploy/deploy.sh`. This script is designed to be resumable, stage-aware, and observable, with core logic split into maintainable libraries in `scripts/lib/`.
### Runtime Architecture
The runtime consists of:
-`deploy.sh`: Orchestration entrypoint.
-`lib/log.sh`: Logging and structured output.
-`lib/state.sh`: Deployment state tracking and stage persistence.
-`lib/inventory.sh`: Reliable host and service discovery (Python-based YAML parsing).
-`lib/compose.sh`: Docker Compose operations.
-`lib/diagnostics.sh`: Post-failure analysis and summary generation.
1.**prepare**: Pulls the latest changes from Git, validates inventory, and prepares the local environment. It is tolerant of network failures to support intermittently connected nodes like CHELSTY.
2.**validate**: Ensures all required service definitions and metadata are present.
3.**deploy**: Executes `docker compose` commands for all assigned services. Supports `.env` files and `docker-compose.override.yml` under `/opt/homelab/config/<service>/`.
4.**verify**: Executes service-specific `healthcheck.sh` scripts or checks container status.
5.**diagnose**: Automatically triggered on failure; collects container status and logs for troubleshooting.
6.**complete**: Finalizes the deployment and marks the state as finished.
-**State**: Local node state is tracked in `/opt/homelab/state/deploy/current_stage`. The last successfully processed service in the `deploy` stage is tracked in `last_service` to support granular resumption.
-**Logs**: Detailed execution logs are stored in `/opt/homelab/logs/deploy/deploy_<timestamp>.log`. Structured log entries prefixed with `[STRUCT]` provide machine-parseable event data.
### Resume Semantics
If a deployment is interrupted (e.g., due to LTE disconnect on CHELSTY):
1. Rerun the script with the `--resume` flag: `scripts/deploy/deploy.sh --resume`.
2. The script identifies the last incomplete stage using deterministic markers (`/opt/homelab/state/deploy/stage_<name>_complete`) and continues from the exact failure point.
3. In the `deploy` stage, it specifically resumes from the first service that was not successfully completed, skipping those already up.
4. Repeated runs are safe and idempotent; completed stages are not re-executed unless the resume flag is omitted (which clears state for a fresh run).
### Diagnostics and Troubleshooting
The runtime is designed to fail predictably and provide immediate feedback:
- **Automatic Diagnostics**: If any stage fails, `collect_diagnostics` is triggered to capture system state and container logs into `/opt/homelab/logs/deploy/diagnostics_<timestamp>.txt`.
- **Deployment Summary**: Every run concludes with a concise summary showing the host status, last stage reached, and log locations.
- **Offline Resilience**: The `prepare` stage handles `git pull` failures gracefully, allowing deployment from local cache during network instability.