Evening Retrospective -- 2026-04-03
Summary
Exceptional day — 18 commits landed, 20 issues closed. Major reliability wave across routing, review, channels, and observability. Agent success rate improved to 86% (179/209 runs), up from 83% this morning. Version sync resolved: CLI and service both at 0.58.2 ✓. SQLite OOB panics remain the sole critical unresolved issue.
What Was Accomplished
Commits Landed (18 total, last 12h)
| SHA | Summary |
|---|---|
d29fb92 | test: add opencode to integration_review.rs test coverage (#1731) |
4aa56a1 | bug: review evaluates wrong diff/summary after reroutes or rebases (#1724) |
6341ee9 | fix: strengthen review agent self-fix for auto-fixable CI failures (#1729) |
2dc29bc | fix: add topic_id to channel subscriptions for topic-aware delivery (#1726) |
3ccceb6 | docs: align control session prompt with supported commands (#1725) |
922d80b | bug: race condition in capture service tick loop (#1718) |
0a21c53 | warn: log warning on /tmp/.orch fallback in TaskRunner::new (#1714) |
12746a0 | fix(engine): warn on invalid config values instead of silent failure (#1715) |
08b578d | feat: job completion notifications via Telegram (#1638 / #1713) |
cac547b | fix: change to COLLABORATOR in gh association |
69044c8 | check: only owner/contributor gh issues and comments sync (#1711) |
b5a5bca | bug: Router::from_config() on async executor in error fallback (#1707) |
e7ff044 | bug: session continuation flag silently dropped on shell format change (#1706) |
b979f29 | fix: prefer skipped uncooled agent over cooled fallback in round-robin (#1705) |
45bd96e | refactor: deduplicate free-model retry logic (3 copies → 1 helper) (#1702) |
c54d9f8 | fix: suppress expected tmux kill-session warnings in logs (#1701) |
448fb3c | feat: allow reroute/retry with specific agent and model (#1699) |
Issues Closed (20)
Key closures: #1728 (review CI exhaustion), #1722 (wrong review diff), #1721 (topic context dropped in channels), #1720 (stale control docs), #1716 (capture race), #1715 (silent config errors), #1714 (/tmp fallback silent), #1711 (non-contributor sync), #1707 (async executor in router), #1706 (session flag dropped), #1705 (cooled agent fallback), #1702 (retry dedup), #1701 (log noise), #1700 (oblivion router stuck), #1698 (reroute control), #1693 (token budget skip), #1692 (#1691 panics addressed), #1690 (token leak in logs).
Morning Priorities — Resolution
| Priority | Status |
|---|---|
| SQLite OOB panics (9 active) | NOT resolved — still actively occurring (see below) |
brew upgrade orch (version sync) | Resolved — 0.58.2 in sync ✓ |
| #1665 dispatch (channel transport collision) | Resolved — fixed via #1706, #1711, #1721 family of PRs |
| opencode:opus model assignment | Partially resolved — #1705/#1706 addressed routing; 0 opencode:opus failures today |
| internal:34220 (blocked code quality) | Not resolved — still blocked at max review cycles |
Operational Health
Agent Performance (last 12h, 209 runs)
| Agent | Model | Success | Failed | Rate | Notes |
|---|---|---|---|---|---|
| minimax | opus | 47 | 0 | 100% | Best performer |
| opencode | qwen3.6-plus-free | 25 | 2 | 93% | Reliable free tier |
| codex | gpt-5.2-codex | 22 | 2 | 92% | Solid |
| opencode | copilot/gpt-5-mini | 21 | 0 | 100% | Primary fallback |
| claude | sonnet | 12 | 5 | 71% | 5 failures — see note |
| codex | gpt-5.3-codex | 10 | 0 | 100% | Clean |
| kimi | opus | 8 | 4 | 67% | 4 failures, billing resets |
| opencode | copilot/gpt-5.4 | 7 | 1 | 88% | Minor |
| opencode | copilot/claude-opus-4.6 | 6 | 0 | 100% | Stable |
| opencode | nemotron-3-super-free | 5 | 0+2 push_failed | 83% | Push scope issue |
| opencode | minimax-m2.5-free | 4 | 1 | 80% | Fine |
Overall: 86% (179 success / 209 runs). Improvement from 83% this morning.
opencode with no model (5 failed, 4 success): Empty model string in 9 runs — likely a routing edge case where model resolution fails silently. The 4 successes suggest the agent infers a default model in some configs. Root cause unclear.
claude/sonnet 5 failures + 2 rate_limits: Credits still running thin. out_of_credits and org_level_disabled events in the last 12h. Cooldown system handling correctly; other agents absorbing load.
kimi 4 failures: Billing cycle resets. The 24h cooldown is still too short for monthly billing cycles, but kimi's failure rate (67%) is improving vs prior days.
Push Failures (3)
opencode/nemotron-3-super-free: 2 failures inbeanrepo — OAuth App token lacksworkflowscope, blocking push of.github/workflows/files. Persistent issue (#1681 class of bug).- 1 failure:
gitnot found in path during push (os error 2 — transient environment issue).
Task Status
| Status | Count |
|---|---|
| done | 758 |
| in_progress | 6 |
| needs_review | 1 |
| blocked | 2 |
Active work: internal:40365 (code quality), internal:40366 (code dev improvements), internal:40367 (this retro), #1732/#1733 (token budget bugs), #1734 (budget tracking bug).
Failures and Root Causes
1. SQLite OOB Panics — STILL UNRESOLVED (Critical)
thread 'sqlx-sqlite-worker-N' panicked at sqlx-sqlite-0.8.0/src/row.rs:43:46:
index out of bounds: the len is 56 but the index is 56All 10 SQLx workers (0–9) are panicking. Error log mtime confirms this is current (Apr 4 01:52 UTC). The panics survive via worker pool auto-recovery, but they indicate a column index mismatch in a SQL query — likely a SELECT * against a table that has been extended by a migration, while an old struct maps fewer columns than the table now has.
This is the #1 operational risk. When all 10 workers panic simultaneously, any concurrent query will fail until workers restart. This could cause task dispatch gaps, lost metric writes, or missed review dispatches.
Related to #1623 (panics in production paths), but the root cause is different — this is in the SQLx row deserialization layer, not in unwrap()/expect() calls.
2. Stale Worktree Metadata (Low Severity)
fatal: not a git repository: /Users/gb/Projects/orch/.git/worktrees/gh-issue-1005-*
fatal: not a git repository: /Users/gb/Projects/orch/.git/worktrees/gh-issue-1130-*
fatal: not a git repository: /Users/gb/Projects/orch/.git/worktrees/gh-issue-1251-*Stale .git/worktrees/ metadata entries for deleted worktrees (issues 1005, 1130, 1251). Git complains on each startup reconciliation. Benign but noisy — can be cleaned with git worktree prune.
3. Token Budget Issues (#1733, #1734 — New)
Two new bugs filed today around token budget enforcement:
- #1733: Pre-run gate sends over-budget tasks into review instead of blocking them
- #1734:
task_runsrecords budget-exceeded runs assuccess, hiding them from analysis
These explain the persistent budget overruns flagged in prior days (2.5M token tasks). The detection works but the enforcement path is wrong.
4. Router LLM Response Parsing (#1732 — New)
The LLM router (haiku) is producing 22KB responses that include hook event JSON mixed with structured routing output. The parser discards non-zero exit structured output. When the router LLM fails, the error diagnosis is lost. Bug filed for fixing.
Routing Analysis
Routing worked well today. No oblivion-class failures (last one was #1700, closed). The #1705 fix (prefer uncooled agents over cooled fallbacks in round-robin) is working — opencode:opus disappeared from failure reports entirely.
Issue #1723 (review re-dispatch assigns incompatible model) has a PR open (#1730, needs_review). This was causing review agents to get opencode:opus which then fails. Once merged, review dispatch should be cleaner.
Router LLM producing very long responses: The haiku router is capturing full SDK hook output (22826 chars) alongside the routing JSON. This is likely a shell output capture issue — the router invocation is capturing hook payloads meant for the SDK transport layer. Not yet a routing failure but indicates the router parse path needs to be more robust (related to #1732).
Open Issues (5 active)
| # | Title | Priority |
|---|---|---|
| #1734 | bug: task_runs records budget-exceeded runs as success | High — hides failures |
| #1733 | bug: pre-run token budget gate sends over-budget into review | High — wrong enforcement |
| #1732 | bug: router LLM non-zero exits discard structured stdout | Medium — diagnosis lost |
| #1723 | bug: review re-dispatch assigns incompatible model | High — PR #1730 in review |
| #1623 | Panics from unwrap()/expect() in production paths | High — related to OOB panics |
Blocked Tasks
| Task | PR | Agent | Cycles | Notes |
|---|---|---|---|---|
| internal:34220 | #1660 | opencode/opus | max | Code quality review — max cycles exceeded |
| #1623 | #1626 | opencode/opus | max | Panics fix — max cycles exceeded |
Both blocked at max review cycles. Both have PRs. May need human review or cycle reset.
Tomorrow's Priorities
SQLite OOB panics — Investigate
row.rs:43:46— find the query with a column index mismatch (likely ataskstable column added in migration without updating the corresponding Rust struct). This is the #1 risk: all 10 workers panicking simultaneously means brief windows where all DB queries fail.Merge #1730 (review re-dispatch model fix) — currently in
needs_review. Once merged, should eliminate the incompatible model issue during review cycles.Token budget enforcement fixes (#1733, #1734) — Agents are processing these now. Watch for PRs and review quickly. The budget enforcement path is broken end-to-end.
Stale worktree metadata — Run
git worktree prunein/Users/gb/Projects/orchto clean entries for issues 1005, 1130, 1251. This is a 30-second cleanup that reduces log noise permanently.Review internal:34220 and #1623 — Both blocked at max cycles. Check if PRs are close to mergeable and either reset cycles or close the tasks.
OAuth
workflowscope — bean repo tasks keep failing to push.github/workflows/files. Either add the scope to the token or configure agents not to create workflow files in that repo.