Recovery from bad merge of task/observer-poison-quarantine (c255a02)
which carried false deletes from a stale branch base. Re-applies only
the genuine observer changes on top of correct master state.
When an event file fails to parse (malformed JSON, truncated, corrupted),
the observer previously kept retrying on every cycle while the node's
checkpoint stayed pinned — all subsequent good events for that node lost.
Now: first parse failure -> atomic os.replace to STATE_DIR/observer_failed_events/<node>/
with collision handling. Checkpoint advances, downstream events flow.
Move failures are logged but don't crash the loop.
Complementary to the atomic_write_json fix on state files; this addresses
the same race-pattern on event files instead.
Regression test asserts: bad event quarantined to failed_events dir,
removed from hot path, subsequent good event processed (node online),
checkpoint moves to good event.
Two root causes for stale "active" incidents on the dashboard:
1. TypeError bug in _prune_stale_world: last_occurrence / resolved_at
can be an ISO-8601 string (stability-agent via events.py) or a Unix
int (node-agent). The previous session's auto-resolve did plain
`time.time() - last_occ` which raises TypeError for strings,
silently preventing _save_world() from being called and leaving
incidents perpetually "active" on disk.
Fix: add _parse_ts(ts) -> float that handles int, float, and
ISO-8601 strings uniformly. All timestamp arithmetic now goes through
it; returns 0.0 on None / garbage to keep comparisons safe.
2. Orphaned active incidents: _resolve_incident clears service["incident_id"]
and marks the incident "resolved" in memory, but if incidents.json was
truncated mid-write (pre-atomic-write era), the observer loaded it at
next startup with status="active" and no service entry pointing to it.
No code ever touched these orphans again.
Fix: _prune_stale_world now runs two cleanup passes each cycle:
- Case 1 (healthy-linked): service.status=="healthy" AND incident_id
still set → resolve immediately (service cannot have active incident)
- Case 2 (orphaned): active incident with no service link AND
last_occurrence > 5 min ago → resolve (5-min guard for creation race)
Both cases are wrapped in try/except so a bug here never crashes the
observer loop or blocks _save_world.
Also fixes the 7-day stale-incident prune to use _parse_ts so
ISO-string resolved_at values are handled correctly.
3. Operator UI: current_incidents() now filters to status=="active" only.
Resolved incidents were previously included in the /incidents endpoint,
making the dashboard show a wall of historical records as if active.
Nocturnal job investigation: _cleanup_control_plane_fs in node-agent runs
every 60s on VPS (not midnight-specific); it reads observer_checkpoint.json
(now written atomically) and deletes old event files. No non-atomic writes
found. Midnight clustering was likely external (logrotate / OS flush);
the supervisor's resilient loader already handles such transient issues.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All JSON state writes (services.json, nodes.json, incidents.json,
deployments.json, runtime-summary.json, observer_checkpoint.json) now use
_atomic_write_json: write to a .tmp sibling, fsync, then os.replace.
This eliminates the truncated-write window that caused supervisors
reading mid-write files to see empty/partial JSON.
Also adds auto-resolution of phantom active incidents: if a service
reports status=healthy and its incident's last_occurrence is >30 min old,
the incident is resolved in _prune_stale_world. This clears false active
incidents accumulated from previous race-condition reads.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
node-agent: use com.docker.compose.service label as canonical name
- Add _canonical_container_name() method: prefers compose label,
falls back to hash-prefix-stripped c.name
- Replace bare c.name usage in check_containers()
- Skip 'created'-state containers (Docker stale-state artifacts)
observer: prune hash-prefixed ghost keys in _prune_stale_world()
- Each reconcile cycle removes service keys matching <node>/<12hex>_<name>
- Acts as safety net for entries already in services.json + future slippage
control-plane/docker-compose.yml already has explicit container_name on
all four services — no change needed there.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
service_healthy is a positive health confirmation — if the service had
an active incident (e.g. from earlier service_unhealthy events), that
incident should be resolved when the service is confirmed healthy.
Previously only service_recovered resolved incidents; service_healthy
set status=healthy but left incidents open, keeping status='degraded'.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- node_agent: emit service_healthy for all running managed containers so
observer populates services.json (previously empty → supervisor flooded
action queue with missing_service redeploys for healthy services)
- node_agent: VPS-only _check_control_plane_health() probes the HTTP
endpoint to emit service_healthy/unhealthy for the 'control-plane' logical
service (multi-container stack, container names don't match service name)
- node_agent: fix _cleanup_control_plane_fs() to read new node_checkpoints
format from observer checkpoint (was reading old last_processed_file key,
always found nothing, never cleaned up old events)
- observer: handle service_healthy event type → sets service status healthy
without resolving incidents (unlike service_recovered which also resolves)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The old mechanism tracked a single 'last_processed_file' and used sorted
filename order to find new events. Remote nodes ship events into
subdirectories (events/piha/, events/chelsty-infra/) that sort
alphabetically BEFORE the VPS directory (events/vps/). Once the
checkpoint pointed to a vps/ file, all piha/ and chelsty-infra/ events
were silently skipped forever.
New mechanism:
- node_checkpoints: {node_dir: last_processed_path}
- Each node directory has its own independent cursor
- New events = files whose path > that node's checkpoint
- Backward-compatible: old 'last_processed_file' is migrated by extracting
the node dir from the path on first load
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Root cause of stale data:
- node_agent.py falls back to socket.gethostname() when NODE_NAME is unset.
Inside a Docker container this returns the 12-char container ID (e.g.
'be17cb6eb0f6'), not the host name. Observer ingested those events and
created ghost entries in world/nodes.json that never expired.
observer.py:
- _prune_stale_world(): removes node/service/incident entries for nodes absent
from topology inventory; called on every run_once() cycle (both new-events
and idle paths). Resolved incidents older than 7 days are also aged out.
- _save_world(): now writes node_count and service_count to runtime-summary.json
so the Dashboard's System Overview cards show real numbers instead of undefined.
operator_ui.py:
- current_nodes/services/deployments/incidents(): the observer stores world state
as keyed dicts; the frontend calls .map() which requires an array. All four
functions now convert the dict to a properly-shaped list. Each item has the
fields the Nodes, Services, Topology, Deployments, and Correlation views expect
(hostname, health, capabilities, desired_state, dependencies, etc.).
- current_incidents(): synthesises a human-readable 'message' field from node +
service + trigger_type (observer does not store one; dashboard showed undefined).
- current_events(): adds a 24 h time filter (EVENTS_MAX_AGE_HOURS env var,
default 24). Without this, every event file ever written was returned,
including events from ghost-node deploys.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- observer: store trigger_type on incidents for supervisor routing
- supervisor: route containers_not_running/mqtt_unreachable to container_restart instead of redeploy
- supervisor: fix node alias normalization via NODE_ALIAS_MAP
- supervisor: fix pending action dedup (scan by content not filename)
- executor: implement container_restart via SSH docker restart with retry
- control-plane override: configure NODE_ALIAS_MAP for production
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>