Evening Retrospective -- 2026-03-29
Summary
High-output day with 20 issues closed. Dominant theme: cascading kimi billing-cycle failures triggering a family of 10 related bugs (cooldown duration, no fallback exclusion, model returns None, no agent cooldown recorded, review gate loops, PR number discarded, auto_unblock counter confusion, silence detection spurious reviews). All 10 fixed today. Remaining work: 4 in-progress bugs + 1 blocked channel issue.
Accomplished Today
Kimi Billing Cycle Cascade (10 bugs fixed)
These bugs form a failure cascade: kimi exhausts its monthly billing quota → 24h cooldown insufficient → retried daily → hits no-fallback path → no cooldown recorded → infinite loop → spurious needs_review → review gate forces Status::New → burns through task attempts. A self-improvement task (#1234) identified the cascade and filed 10 root-cause issues, all merged today.
- #1253 — kimi billing-cycle cooldown 24h → 7 days (cooldown duration too short for monthly billing cycle)
- #1250 —
handle_failover"no fallback" path never records agent cooldown (kimi loops indefinitely) - #1252 —
next_round_robin_agentlast-resort fallback picks cooled agents (kimi selected when all excluded) - #1251 —
model_for_complexityreturnsNonewhen all pool models cooled (opencode exits 0 silently) - #1240 — transport
last_outputpersists across retries (replays stale output) - #1243 — review gate loops via
Status::Newon failover exhaustion instead of marking Blocked - #1258 —
create_pr_if_neededdiscardspr_numberwhen PR exists (tick review gate misroutes to Done) - #1254 — startup rebase fails on unstaged changes (worktree deleted, agent work silently lost)
- #1248 —
reset_failure_counterszeroesauto_unblock_count(CI auto-recovery fires on every review cycle) - #1247 — silence detection kills session but runner still processes exit-0 (spurious needs_review + review cycle)
Error Classification Fixes (5 bugs fixed)
- #1235 — router misclassifies valid JSON error envelopes as parse failures
- #1202 — "model unavailable" errors not classified as recoverable (blocked auto-unblock)
- #1204 —
classify_failurematches "parse" substring ("sparse" and others trigger ParseError auto-unblock) - #1223 —
classify_run_error_typecontains("test")matches "latest" (tag-not-found misclassified as ci_failure) - #1222 —
contains("usage")andcontains("test")too broad (connection errors and URLs misclassified)
Review Pipeline Fixes (5 bugs fixed)
- #1221 —
parse_review_from_outputskips keyword inference when NDJSON succeeds but JSON parse fails - #1218 —
extract_router_textreturns None for opencode NDJSON without text events - #1217 —
classify_failurecontains("pull request")broad match missed by prior PR - #1210 — auto_unblock failure analysis includes review runs (review parse failures trigger incorrect auto-unblock)
- #1209 —
classify_failure"pull request" && "fail" substring match misclassifies connection errors
Worktree & Persistence Fixes (5 bugs fixed)
- #1255 — startup reconciliation runs git fetch once per worktree (N sequential fetches on restart)
- #1245 — startup rebase destroys worktree with unstaged changes (loses agent work on restart)
- #1236 — self-improvement failure query uses removed
tasks.last_responsecolumn - #1225 —
reconcile_startup_worktreesnever runsgit worktree prune(orphaned entries persist) - #1214 —
auto_unblock_last_atnot persisted onset_fieldsfailure (cooldowns silently bypassed)
Dispatch & Auto-unblock Fixes (5 bugs fixed)
- #1231 —
run_with_contextomits "needs_review" from success signal (rate-limited agents never recover) - #1226 —
is_rate_limitedcheck treats any re-routed task as rate-limited (incorrectly degrades agent weights) - #1232 —
handle_review_changesreadsauto_unblock_countasci_recovery_count(two mechanisms share one counter) - #1205 —
parse_retry_atuses byte offset from lowercased string (panic or wrong date on non-ASCII) - #1213 —
list_sessions()extracts wrongtask_idfor internal tasks (session cleanup and CLI display broken)
What Failed / Needed Escalation
Open Issues (end of day)
| ID | Status | Agent | Title |
|---|---|---|---|
| #1257 | in_progress | minimax | user config model_map has stale github-copilot/* models — burns 2 silent-failure attempts |
| #1244 | in_progress | minimax | model cooldowns are in-memory only — lost on restart, immediate retry of failing models |
| #1227 | in_progress | minimax | auto_unblock_count should reset when block reason changes |
| #1232 | in_review | minimax | handle_review_changes reads auto_unblock_count as ci_recovery_count |
| internal:23649 | in_progress | minimax | Self-improvement: debug agent errors and fix root causes |
| internal:23580 | in_review | opencode | Daily morning review |
| #1241 | blocked | opencode | channel thread bindings never cleared, chats stuck on first task (4 attempts) |
Kimi Billing Status
Kimi is exhausted (billing cycle quota) and is being correctly avoided by all tasks. The 7-day cooldown fix (#1253) means kimi will not be retried until the billing cycle refreshes. However, note that cooldown:kimi in KV shows Unix timestamp 1774871904 (~2026-04-02), and cooldown:kimi:k2p5 shows 1774447929 (~2026-03-26) — stale values from before the fix. These will be cleaned up on next cooldown events. The fix itself (#1259) is deployed.
Recurring: opencode Silent Exit-0
opencode continues to exit 0 with no output (observed in task_runs: opencode exit 0: , no fallback agents, opencode exit 0: , rerouted to minimax). This is happening with multiple models (copilot/gpt-5.4-mini, copilot/gpt-5-mini, and models with empty names). The silence detection mitigates it by rerouting, but it's burning unnecessary attempts. One task (#1241, channel thread bindings) has 4 failed attempts, all with opencode silent exits. The router may be selecting opencode inappropriately for some task types.
#1236 False Positive
The tasks.last_response column was reported as removed but the column actually still exists in the schema (it's in V1 migration). The agent filed #1236 as a bug, found and "fixed" it by re-adding a query that should have been removed. Need to verify this isn't a spurious fix.
Routing Accuracy
Agents used today (from task_runs):
| Agent | Runs | Outcomes |
|---|---|---|
| minimax | ~35 | ~33 success, 1 rate_limit (transient), 1 failed (opencode exit-0 rerouted) |
| opencode | ~10 | Mix of success and exit-0 failures, all rerouted to minimax/claude |
| kimi | ~3 | All rate_limit (billing cycle exhausted), all rerouted to minimax |
| claude | ~5 | Mix of success and rate_limit, rerouted as needed |
No routing misclassifications observed. Failover and rerouting worked correctly throughout. All 20 issues closed successfully through the minimax pipeline.
Patterns & Health
Positive:
- Exceptional throughput: 20 issues closed in one day — the kimi cascade investigation was highly productive.
- Self-improvement working: #1234 (the meta-issue filed yesterday) correctly diagnosed the 10-bug cascade and all were fixed today.
- No CI failures: All 5 commits from today passed CI cleanly.
- minimax is the reliable workhorse: 33+ successful runs, correctly handling all reroutes.
Concerning:
- opencode reliability still poor: Silent exit-0 with empty error messages persists. #1241 has 4 failed opencode attempts. Consider routing #1241-class tasks away from opencode.
- Stale KV entries:
cooldown:kimi:k2p5expired 3 days ago but still in KV. The cooldown system may not be cleaning up expired entries. - #1232 in_review: The auto_unblock/ci_recovery counter sharing issue is now in review — it's been a recurring theme this week (also #1248 closed today).
- #1244 (in-memory cooldowns lost on restart): This is a significant durability bug — needs to persist cooldowns to SQLite.
Tomorrow's Priorities
- Monitor #1257 (stale copilot models) and #1244 (in-memory cooldowns) — both are infrastructure bugs that cause silent failures. Target closing both tomorrow.
- #1227 + #1232 (auto_unblock counter sharing) — these are related to the same counter mechanism. #1232 is in review; #1227 needs the reset-on-change fix. Close both.
- #1241 (stuck, 4 opencode failures) — the task is blocked with 4 opencode failures. May need to manually label
agent:claudeoragent:minimaxto force routing away from opencode. - Stale KV cooldown entries — check if expired entries accumulate. If so, add cleanup logic.
- Verify #1236 — the "removed column" fix may be spurious. Check if
last_responseactually exists and whether the original failure was transient.