Evening Retrospective — 2026-05-16
Summary
Today closed three reliability bugs identified in last night's retro and this morning's review:
bec81470/ #3146 —fix(engine): closed-issue reconciliation now callslist_reconciliation_candidates()directly instead of the expensivelist_all_tasks()(which was timing out at 30s every tick and falling back to cached tasks).e437f712/ #3145 —bug: opencode periodically dispatched to cooled/deadgithub-copilot/*models.ea8d421e/ #3144 —fix(runner): auto-retry transient pass-store GPG decryption blockers so headless agent sessions don't permanently block when the GPG agent is unavailable.
Two releases were cut (v0.71.14 at 21:24 UTC, v0.71.15 at 21:31 UTC) but the running service is still on v0.71.13 — the cleanup-timeout fix is on disk but not yet deployed (see Priorities below).
What Was Accomplished
- Merged #3144, #3145, #3146 — addressing the three top-priority reliability issues outstanding at the start of the day.
- Reconciliation fix (#3146) replaces a known-slow listing path with a purpose-built candidate query that already worked as the fallback (count=194). This is the right fix per AGENTS.md: same generic mechanism, no special-casing.
- The morning review's #1 priority ("investigate and fix reconciliation timeout 3116/3117") was resolved by the day's last commit.
What Failed, Retried, Or Needed Intervention
1) Closed-issue reconciliation timeout still observable
The fix landed at 18:24 BRT but the running service is v0.71.13. Every ~50s in current logs:
WARN orch::engine::cleanup: timed out listing all tasks for closed-issue reconciliation timeout_secs=30
INFO orch::engine::cleanup: using fallback tasks for closed-issue reconciliation count=194This is a deployment lag, not a regression — brew upgrade orch && orch service restart to pick up v0.71.15.
2) LLM routing budget exceeded on this task
internal:149727 (this retrospective) and internal:149723 both exceeded the 30s LLM budget and fell back to round-robin selection. internal:149727 was routed to kimi:opus via round-robin (index 3 of 6 agents). A slow tick of 55373ms was logged during dispatch — within the watchdog threshold (6 × tick_interval = 60s) but close. This is the intended cascade behavior (#3050).
3) Carryover blockers (no progress, waiting on owner)
internal:149337— SSH agent signing failure on git push (sign_and_send_pubkey: signing failed for ED25519 ... from agent: communication with agent failed). Owner action: fix SSH agent or switch remote to HTTPS.#3110— Claude 401 "Invalid authentication credentials". The triage agent has explicitly requested log excerpts and per-task artifacts; no response yet.
4) Kimi review/run rate-limit pattern persists
kimi/opus recorded 69 failed and 24 rate_limit outcomes in the last 24h against 158 successes. The recent fix #3134 reduced false parse_error classification but did not address upstream rate limiting on the kimi provider. Tasks self-recover via fallback reviewers; no new issue warranted — the generic cooldown system is the right place for this.
Routing Accuracy
- Distribution remained healthy across all six agents (claude, codex, opencode, kimi, minimax, glm).
opencode/github-copilot/gpt-5.3recorded only 5 failures in 24h (down from prior days) — the per-model cooldown for unavailable models is doing its job.- Two
LLM routing budget exceededevents in the visible log window — at the expected low rate, with correct fallback to round-robin. internal:149727round-robin selection ofkimi:opusfor the evening retrospective is a known weak link given today'skimi:opusrate-limit volume, but it's the correct generic behavior and the task is in progress.
Performance / Bottlenecks
- One
slow tick elapsed_ms=55373warning — caused by a 30s LLM budget timeout immediately followed by two synchronous worktree creations within the same tick. Acceptable but worth watching if it recurs. cleanupwarnings every tick remain, pending the v0.71.15 deploy (see Priorities).- No watchdog stalls, no fallback-loop patterns, no dispatch storms in the last 24h.
Learnings Captured Today
- The right fix for a slow listing path is to use a purpose-built query, not to extend the timeout. The reconciliation candidate listing was already implemented as the fallback — promoting it to the primary path is strictly an improvement.
- Per-model cooldowns plus generic recovery handle dead/unavailable copilot models without needing hardcoded model lists in router code (reinforces the AGENTS.md "NEVER hardcode model names" guidance — issue #3141 was fixed without adding any model strings to
src/engine/router/). - Pass-store credential resolution must fail-open under transient unavailability (GPG agent not responding in headless tmux), retry, and never permanently block a task — same generic principle as agent cooldowns.
Priorities For Tomorrow (Morning Review)
- Deploy v0.71.15 (
brew update && brew upgrade orch && orch service restart) — the closed-issue reconciliation fix is on disk but not yet running. Verify thecleanup: timed out listing all taskswarnings stop after restart. - Confirm no regression post-#3144: monitor for any new credential-decryption blocks; the auto-retry path should keep tasks moving even if GPG is temporarily unavailable.
- Confirm no regression post-#3145: check
task_runsfor any newopencode → github-copilot/*failures with "model not found" errors. Persistent per-model cooldown should keep them out of selection. - Push owners on carryovers: request a concrete error window/log excerpt for #3110 (Claude 401) and confirm SSH agent fix for
internal:149337. These remain the only blocked orch tasks not on the autonomous-fix path.
Issues Created
None tonight.
The day's three discovered problems were already tracked (#3140/#3141/#3142/#3143) and all closed. Remaining items are either deployment lag (no issue needed) or carryover waiting on owner input (existing issues already open).
Prepared by Orch automation (internal:149727).