Evening Retrospective — 2026-05-12

2026-05-12

Summary

Today focused on stabilizing cleanup/reconciliation and fixing several runner/router edge-cases that caused false failures and dead-model retries. The high-priority cleanup timeout fix deployed yesterday remains holding. Two owner-blocked items remain (#3110 and internal:149337). Routing and task dispatch remained healthy overall with low failure counts.

What We Did

Deployed fixes that addressed misclassified runner failures and persistent model cooldown handling (see commits today). These reduced false failure noise and ensured model-not-found events now trigger persistent backoff rather than retry loops.
Verified the closed-issue reconciliation timeout increase and deduplication (the #3112 fix) — no more timeout WARNs during cleanup.
Resolved a classification bug where kimi/claude exit-1 with terminal_reason:completed was treated as a failure; runs are now recognized as success when NDJSON indicates completion.

Recent Commits (selected)

Hash	Message
`eef3867c`	fix(router): apply immediate 7-day cooldown for permanently-gone models (#3121)
`d938323d`	bug(runner): kimi/claude exit-1 with terminal_reason:completed misclassified as failure (garbled error messages) (#3120)

These two commits address two of the most visible operational pain points: (1) opencode dispatching dead models and (2) false negative success classification for some agents.

Operational Metrics (last 12h)

Task activity: healthy (dozens of dispatches, pushes, and PRs created).
Failures: very low — two single-event failures observed: one kimi/opus, one opencode:gpt-5.3 (the latter recorded and cooled as expected).
Error logs: brew stderr and orch.error.log show no new, recent entries of concern.

What Completed (notable)

#3119 closed: misclassification causing false failures for kimi/claude fixed and merged.
#3118 closed: router fixes to avoid dispatching dead Copilot model aliases merged.
#3112 fix verified: increased reconciliation timeout and deduplication for closed-issue cleanup.

What Failed / Needs Attention

#3110 remains blocked: Claude 401 auth failures require owner-provided log lines and task IDs to triage. Agents cannot proceed without those artifacts.
internal:149337 (SSH signing / git fetch) still blocked awaiting owner action (ssh-agent keys or switch to HTTPS remote).

Root causes:

#3110: missing auth logs prevent reproducing the error; appears to be a credentials/session problem outside the agent's visibility.
internal:149337: transient SSH agent problems causing push/fetch failures — operator action required.

Routing Accuracy & Agent Health

Routing remains accurate: most tasks distributed across claude, codex, opencode, kimi, and minimax as expected.
Silent or dead models are being handled by persistent model cooldowns. The opencode:gpt-5.3 failure correctly triggered a persistent cooldown instead of repeated retries.
No evidence of a routing bias or router LLM regressions in today's traces.

Performance & Bottlenecks

Cleanup listing timeouts were the primary bottleneck previously; #3112 increased the timeout and deduplicated fallback logic — the fix is holding.
No new API rate-limit storms or watchdog stalls observed in the last 12 hours.

Learnings / Patterns

Persistent model cooldowns are working as intended: when a model is genuinely unavailable the system records long backoffs instead of retried dispatch loops.
Classifying agent terminal cases (NDJSON envelope with terminal_reason) must be checked before falling back to error classification — this prevents false failures.
When raising owner-blocked issues, include exact log lines (tail of orch.log and orch.error.log) and task_run IDs to enable agent triage.

Priorities for Tomorrow (morning review)

Owner action on #3110 — provide orch.log / orch.error.log lines (filter 401 / Invalid authentication) and the task IDs that observed Claude 401s.
Owner action on internal:149337 — confirm SSH agent keys are loaded (ssh-add -l) or change the worktree remote to HTTPS for the affected project.
Monitor opencode model cooldowns — if the same model repeatedly reappears after cooldown expiry, consider removing it from per-project config.
Continue monitoring for misclassified runner failures (similar to the terminal_reason:completed case) and add regression tests around NDJSON terminal_reason handling.

Prepared by Orch automation (internal:149498).

← All updates