Evening Retrospective — 2026-04-10
Summary
Day 4 of the reliability sprint closes with 43 commits, 20 issues closed, and zero open issues. Three audit trail bugs consumed most of the day's cycles. The pipeline is performing exceptionally well — 85% success rate on 494 task runs, throughput up ~30% vs yesterday.
Morning Priorities — Outcome
| Priority from morning review | Status |
|---|---|
| Upgrade CLI/service (0.60.x → 0.61.x boundary) | Unknown — not confirmed. Version may still be stale. |
| Verify #2317 silence detection fix | Done — opencode now has 25+ successes with minimax-m2.5-free; no 600s kills observed. |
| Investigate claude timeouts (5) | Unresolved — 5 new claude/sonnet timeouts today. Persistent pattern. |
| Investigate olm/gemma4 absence | Inconclusive — still absent from routing. Not configured or intentionally excluded. |
| Monitor throughput stability | OK — 30% throughput increase held, error rate proportionally lower. |
What Was Accomplished
Today's commits — grouped by theme
Audit trail integrity (3 major bugs, ~18 commits)
3d8b40a8(closes #2394) prevent stale last_error from prior agent contaminating task_run audit- UPSERT logic was overwriting
last_erroreven when the new run succeeded, leaving the previous agent's error in the record. - Impact: audit trail showed wrong agent paired with wrong error. Makes post-mortem analysis unreliable.
- UPSERT logic was overwriting
afbd2018(closes #2394) preserve attempts counter in reset_failure_counters to prevent audit trail overwrites- Related:
reset_failure_counterswas clearing the attempts counter before recording the new run, makingtask_runs.attemptsalways show 1. - Impact: all task runs appeared to be first attempts even on retries.
- Related:
b422dad3(closes #2392) ReviewDecision::Rerouted stops no_code_reroutes counter reset — infinite retry loopRerouteddecision was missing from the "don't reset" list, causing theno_code_reroutescounter to reset on every re-route.- Impact: tasks with no PR and no commits would retry indefinitely, never hitting the block threshold.
faa2970dtest regression tests for no_code_reroutes counter accumulation (#2392)9ad9d33e(closes #2389) bug: 41 fire-and-forget store_set calls silently discard DB write failures- Massive cleanup: 41 sites across the codebase where
store_setfailures were ignored. - Pattern:
tokio::spawn(async { store.set(...) })— failures silently swallowed by the spawned task. - Fixed by adding proper error handling or using a shared
observe()pattern.
- Massive cleanup: 41 sites across the codebase where
513eb318(closes #2384) review.rs fire-and-forget store_set for push_failures reset causes counter drift8c8381a4(closes #2383) store.get() error defaults to needs_review instead of done- When
store.get()failed, it was treated as "no PR found" →needs_review. Should bedone(nothing to review). - Impact: tasks with transient DB errors would skip review instead of completing.
- When
03974275(closes #2357) run_agent_session ignores invocation.timeout_seconds — per-complexity timeout unimplementable
Performance improvements
bebc68d1perf: parallelize fallback_is_pr_merged_by_branch in cleanup.rs — concurrent branch checks for worktree cleanup.47a4311fperf: scan_comments re-introduces serial is_pull_request calls — regression from batch fix #2342, serial calls re-introduced accidentally. Fixed.5f11441bperf: batch is_pull_request checks in scan_mentions with join_all (from yesterday, carried forward)
Mention / command / review fixes
03225a9f(closes #2377) acknowledge_mention failure leaves mention unacknowledged — cursor advances prematurely36ba2a80(closes #2376) advance cursor on FetchFailed/CollaboratorCheckFailed in scan_comments — infinite retry loope6504d11fix: use expect() instead of unwrap_or(false) for is_pr_map lookup (#2378)b9bdfba0(closes #2368) auto_unblock_blocked_tasks increments counter before status update42b87fadrefactor: unify scan_mentions and scan_commands into single scan_comments3aa4ee0e(closes #2363) handle_slash_command returns bool that callers ignore — unreachable false branch5a3cfc58fix(sync): only advance mention cursor on successful store write (#2362)
Push / merge reliability
4bf387f0(closes #2382) pop stash and bail when rev-parse fails after git stash succeed56eff734(closes #2335) recover push failures by rebasing on remote branch
Tests
ff1b490d(closes #2370) test: add test coverage for review_poll.rs — 19 tests across all 6 paths673a6b8c(closes #2372) test: all 4 event subscribers (dispatch, notify, review, unblock) have zero test coveragefaa2970dtest: regression tests for no_code_reroutes counter accumulation (#2392)
Refactors
982ccc3crefactor: extract shared merge-conflict helper in review_poll.rs — ~120 lines duplicated973339afdeps: replace unmaintained serde_yml with serde_norway
Observability
c0ac2932(closes #2358) fix(ndjson): render Codex turn.failed events in orch stream
What Failed and Why
claude/sonnet timeouts (5 in 24h, persistent from yesterday)
5 new timeout outcomes for claude/sonnet, distinct from failed. This is the same count as yesterday despite a 40% increase in total task runs. If it were purely proportional, we'd expect ~7 timeouts. The count is holding flat, which is mildly encouraging, but the pattern hasn't been root-caused.
The morning review flagged this for investigation — it wasn't addressed. Need to determine whether these are silence-detection timeouts (vs hard timeouts or something else), and whether the count is noise or a real ceiling on claude/sonnet latency.
Fire-and-forget pattern is pervasive
41 sites of store_set failures being silently discarded — this was the biggest cleanup of the day. These were real bugs hiding behind fire-and-forget semantics. The pattern is now documented as a anti-pattern and fixed across the codebase, but it highlights how far the codebase had drifted from proper error handling.
Routing Accuracy
Overall: excellent. 85% success rate (420/494) is the highest since the sprint started. Key improvements visible:
- opencode silenced detection (#2317): opencode now successfully runs 25+ sessions with minimax-m2.5-free. The fix is working.
- Audit trail: task_runs now correctly records agent, error, and attempts for all runs.
- no_code_reroutes infinite loop: tasks without PRs now correctly hit the block threshold instead of retrying forever.
- Store error handling: 41 DB failure sites now handle errors properly — tasks won't silently skip review or lose state.
Agent health (24h, 494 runs)
| Agent | Model | Success | Failed | Timeout | Rate Limit | Total | Rate |
|---|---|---|---|---|---|---|---|
| claude | sonnet | 137 | 12 | 5 | 3 | 157 | 87% |
| kimi | opus | 79 | 5 | 2 | 3 | 89 | 89% |
| codex | gpt-5.3-codex | 70 | 4 | 0 | 0 | 74 | 95% |
| minimax | opus | 54 | 0 | 3 | 2 | 59 | 92% |
| opencode | minimax-m2.5-free | 25 | 0 | 0 | 0 | 25 | 100% |
| opencode | gpt-5-mini | 14 | 0 | 0 | 0 | 14 | 100% |
| claude | opus | 16 | 1 | 0 | 0 | 17 | 94% |
| opencode | nemotron-3-super-free | 10 | 5 | 0 | 0 | 15 | 67% |
Notes:
- opencode/nemotron is the weakest model (5 failures, 67% success). Worth watching — may need cooldown.
- opencode overall is now healthy with minimax-m2.5-free and gpt-5-mini (100% success).
- claude/sonnet has the most volume and the most failures (12 failed, 5 timed out — 10% failure/timeout rate).
- codex is performing well at 95% despite being the second-highest volume.
- kimi/opus rate limits are still low (3/89 = 3.4%), backoff handling correctly.
- ol m/gemma4 — still absent from routing entirely.
Open Issues
None. All 20 bugs filed today are closed. This is the first zero-open-issue retrospective since the sprint started.
Priorities for Tomorrow
Root-cause claude/sonnet timeouts — 5 timeouts per 157 runs (3.2%) may be silence detection misfires or a legitimate latency ceiling. Need to look at actual session durations in
task_runsto determine if these hit a hard timeout, a grace period, or something else. Check thetask_runstable forduration_secondson these records.Upgrade CLI/service — now in its 3rd day as an unconfirmed issue. The service is on
0.61.x, CLI on0.60.x. At this point the CLI upgrade is overdue. Check during morning review:brew upgrade orch && brew services restart orch && orch versionInvestigate opencode/nemotron failures (5) — 67% success rate for this model is the lowest across all agents. Either apply a cooldown or investigate why it's failing disproportionately.
Verify audit trail fixes in production — the 3 task_runs bugs (#2394, #2393, #2392) all landed with tests, but the real validation is in the production DB records. After the morning review, spot-check
task_runsto confirm agents/errors/attempts are all correct.ol m/gemma4 status — still absent from routing after 2 days. Either confirm it's intentionally excluded from the agent config, or investigate why it's not appearing.