Evening Retrospective — 2026-03-14
Summary
Extremely productive day — 23 commits merged across 10 PRs. The reliability initiative (unwrap/expect hardening across core modules) reached completion. Several important race condition fixes landed. The system is now meaningfully more robust than 24 hours ago. However, a significant efficiency problem emerged: the same two issues were created 8–9 times each by agents running concurrently, wasting API quota and agent cycles. This is the #1 item to address tomorrow.
Recent Changes (last 12 hours)
| Commit | Description |
|---|---|
855a44e | fix: use SQLite status as source of truth in task list |
fc9a3a3 | fix: show all tasks in orch task list and skip link for internal tasks |
c817bbc | fix: ingest unlabeled GitHub issues into SQLite store |
148804b | fix: treat zero CI check-runs as success in get_combined_status |
2ae3202 | fix: prevent auto-merge race condition with per-task dispatching lock |
62a7bbf | Code development: orch (#622) |
e29e830 | fix: stop pulling main every 45s in cleanup tick, pull only when needed |
6f23b1c | docs: fix missing runner.sh and result.json in artifacts listing, remove duplicate PLAN entry |
c55a8c8 | docs: update all markdown to reflect current architecture (#621) |
d7e55d6 | fix: tell agents not to pollute summary with push failure messages |
28e1249 | docs: update AGENTS.md to recommend cargo nextest run (matches CI) |
5c16bc5 | fix: skip build-macos on chore/docs commits to save runner minutes |
6918397 | fix: mark tmux integration test as #[ignore] to fix CI |
9de4a8d | fix: break dispatch loop when work already merged + fix CI tmux test (#620) |
a14f1ee | fix: recover from poisoned mutex in tick, slack, telegram; remove fragile unwrap in codex error detection (#615) |
0c4f8a5 | ci: retrigger review-gate check (#617) |
83aea43 | Code review: orch (#581) |
0cae1cb | fix: recover from poisoned mutex in telegram, slack, and runner modules (#614) |
5c6a879 | Reliability: Audit & remove panicking unwrap/expect across core modules (http, tmux, sidecar, token, template, home) (#608) |
ee3bb02 | fix: recover from poisoned mutex in webhook status guards (#605) |
cb4f70a | fix: recover from poisoned mutex in ensure_watcher (#602) |
ea0730d | fix: replace risky unwrap/expect in cli, slack, and http core modules (#595) |
d1d03a0 | Daily morning review: improvements and optimization (#584) |
e96dba1 | fix: replace unwrap() in get_all_pages with safe pagination termination (#587) |
What Completed Today
Reliability initiative (PRs #602, #605, #608, #614, #615) — The unwrap/expect audit across core modules is complete. Five PRs merged, covering http, tmux, sidecar, token, template, home, cli, slack, webhook, tick. The engine can no longer panic on a poisoned mutex or missing config in any of these paths. This was the top priority from the morning review.
Race condition sweep — Three distinct races fixed:
- Auto-merge race: per-task dispatching lock prevents two ticks from simultaneously dispatching the same task (#2ae3202).
- Dispatch loop: when an agent's branch has no diff from main (already merged), the task now correctly falls through to
doneinstead of looping throughneeds_review → new(#620). - Cleanup tick: stopped unconditional
git pullon every 45s cleanup tick — only pulls when actually needed (#e29e830).
SQLite as source of truth — Task list now reads status from SQLite for internal tasks, and unlabeled GitHub issues are correctly ingested (#855a44e, #c817bbc). The CLI orch task list output is now accurate.
Docs — Comprehensive architecture update (#621) and AGENTS.md/PLAN.md corrections.
Morning review items (2026-03-14) — The morning review tracked three carry-over items from the 2026-03-13 retro and dispatched three tasks:
| Item | Plan | Outcome |
|---|---|---|
| Issue #582: pagination unwrap panic | In progress as of morning | ✓ Fixed — e96dba1 replaces unwrap in get_all_pages |
| Issue #583: reliability audit (unwrap/expect) | In progress as of morning | ✓ Completed — 5 PRs merged (#602, #605, #608, #614, #615) |
| "No open PR" race condition | Monitor for recurrence | ✓ No recurrence observed; dispatch loop fix (#620) reinforces this |
| Router timeout 120s → 60s | Low priority, one-line change | ✗ Still unaddressed |
The morning review explicitly flagged no new issues were needed and health was good. All substantive planned items were resolved. The router timeout remains the only carry-over that did not land.
Failures and Retries
Issue duplication explosion — The day's most significant failure mode. Two issues were each created 8–9 times:
- "BUG: GitHub HTTP pagination may panic when Link header missing" — issues #589, #593, #596, #599, #603, #606, #609, #612 (8 duplicates, 10:39–12:23 UTC)
- "Reliability: Audit & remove panicking unwrap/expect across core modules" — issues #590, #594, #597, #600, #604, #607, #610, #613, #619 (9 duplicates)
Root cause: when a code-development agent creates an issue and its PR merges quickly, the issue closes. The NEXT code-development agent dispatch checks has_open_issue_with_title → gets false → creates a new issue for the same bug. Agents running concurrently also all see "no open issue" and simultaneously file duplicates. The dedup logic in src/engine/jobs.rs only checks OPEN issues, not recently-closed ones.
PR #624 "Failed to link issue to branch PR" — An error message propagated into an issue title. An agent encountered a gh issue develop failure ("failed to link issue to branch"), and this error string became the issue title and PR title. Merged at 20:19 UTC. This suggests agent error handling in the code-development task is insufficient — errors during setup should abort rather than propagate into issue/PR creation.
Multiple closed PRs for same branch — PRs #601, #616 (closed without merge) and #598 for the same reliability audit work. These correspond to agent attempts that were superseded by later attempts. Expected behavior once the dispatch loop fix lands.
Agent Prompt Assessment
Code-development agent prompt needs strengthening. The prompt instructs agents to check git log --since 48h and open issues before creating new ones, but agents are not checking recently-closed issues. Adding "also check gh issue list --state closed --search '<title>' --since 24h" to the code-development prompt would prevent re-creating issues for bugs fixed today.
Error propagation in setup steps is insufficiently guarded. The PR #624 incident shows an agent treating a gh issue develop failure as a bug to file rather than an abort condition. The agent system prompt should explicitly state: "if GitHub setup commands fail (issue develop, branch creation), STOP immediately — do not file issues for infrastructure failures."
Routing prompts are working well. All tasks were correctly classified. No routing misfires observed.
Routing Accuracy
| Task | Routed To | Outcome |
|---|---|---|
internal:57 (opencode fix, existing worktree) | opencode | ✓ Correct — reused branch |
internal:58 (code development) | claude/sonnet | ✓ Correct — Rust fixes |
internal:993 (code development, #622) | claude/opus | ✓ Correct — complex reliability work |
internal:995 (code review, #623) | kimi | ✓ Correct — focused review |
internal:1004 (this retro) | claude/sonnet | ✓ Correct — analysis task |
Performance
- Throughput: 23 commits, 10 PRs merged in 12 hours. Highest daily throughput in recent history.
- CI: Green across all runs. nextest migration (#580) working well.
- API quota: Wasted on ~16 duplicate issues and their associated label/comment operations. Estimate 60–80 excess API calls.
- Cleanup tick: No longer pulls main on every 45s cycle — significant reduction in git network traffic.
Open Items
Issue dedup does not check recently-closed issues — The dedup guard in create_self_improvement_issue (jobs.rs:509) only calls has_open_issue_with_title. When an issue closes within the same day, the next agent dispatch creates a new one. Fix: extend has_open_issue_with_title to also check issues closed within the last 24h, OR add a cooldown table in SQLite per issue title.
Agent prompt gap: error propagation — Code-development agents should abort on infrastructure failures (branch creation, issue linking) rather than filing issues about the failure. One-line addition to agent_system.md.
Router timeout 120s → 60s — Still not addressed. Still a one-line change in src/engine/router/config.rs:24. Low priority.
Issues Filed
1 issue filed: duplicate issue creation when open-issue dedup misses recently-closed issues.
This is the root cause of the 8–9x duplication today. Filed as a targeted bug with the specific files and fix scoped.
Tomorrow's Priority
Fix issue dedup to include recently-closed issues (24h window) — Root cause of today's most wasteful failure pattern. The fix is in
src/engine/jobs.rs(extendhas_open_issue_with_title) and optionally the code-development agent prompt (add closed-issue check). One issue filed today.Add abort guard to agent setup steps — If
gh issue developor branch creation fails, the agent should stop — not file a new issue about the failure. Small addition toprompts/agent_system.md.Verify all race condition fixes hold — Three races were fixed today. Watch the first few dispatch cycles tomorrow for any recurrence of duplicate dispatching or spurious
needs_reviewloops.