Gabriel Koerich Orchestrator

Evening Retrospective — 2026-03-14

Summary

Extremely productive day — 23 commits merged across 10 PRs. The reliability initiative (unwrap/expect hardening across core modules) reached completion. Several important race condition fixes landed. The system is now meaningfully more robust than 24 hours ago. However, a significant efficiency problem emerged: the same two issues were created 8–9 times each by agents running concurrently, wasting API quota and agent cycles. This is the #1 item to address tomorrow.


Recent Changes (last 12 hours)

CommitDescription
855a44efix: use SQLite status as source of truth in task list
fc9a3a3fix: show all tasks in orch task list and skip link for internal tasks
c817bbcfix: ingest unlabeled GitHub issues into SQLite store
148804bfix: treat zero CI check-runs as success in get_combined_status
2ae3202fix: prevent auto-merge race condition with per-task dispatching lock
62a7bbfCode development: orch (#622)
e29e830fix: stop pulling main every 45s in cleanup tick, pull only when needed
6f23b1cdocs: fix missing runner.sh and result.json in artifacts listing, remove duplicate PLAN entry
c55a8c8docs: update all markdown to reflect current architecture (#621)
d7e55d6fix: tell agents not to pollute summary with push failure messages
28e1249docs: update AGENTS.md to recommend cargo nextest run (matches CI)
5c16bc5fix: skip build-macos on chore/docs commits to save runner minutes
6918397fix: mark tmux integration test as #[ignore] to fix CI
9de4a8dfix: break dispatch loop when work already merged + fix CI tmux test (#620)
a14f1eefix: recover from poisoned mutex in tick, slack, telegram; remove fragile unwrap in codex error detection (#615)
0c4f8a5ci: retrigger review-gate check (#617)
83aea43Code review: orch (#581)
0cae1cbfix: recover from poisoned mutex in telegram, slack, and runner modules (#614)
5c6a879Reliability: Audit & remove panicking unwrap/expect across core modules (http, tmux, sidecar, token, template, home) (#608)
ee3bb02fix: recover from poisoned mutex in webhook status guards (#605)
cb4f70afix: recover from poisoned mutex in ensure_watcher (#602)
ea0730dfix: replace risky unwrap/expect in cli, slack, and http core modules (#595)
d1d03a0Daily morning review: improvements and optimization (#584)
e96dba1fix: replace unwrap() in get_all_pages with safe pagination termination (#587)

What Completed Today

Reliability initiative (PRs #602, #605, #608, #614, #615) — The unwrap/expect audit across core modules is complete. Five PRs merged, covering http, tmux, sidecar, token, template, home, cli, slack, webhook, tick. The engine can no longer panic on a poisoned mutex or missing config in any of these paths. This was the top priority from the morning review.

Race condition sweep — Three distinct races fixed:

  1. Auto-merge race: per-task dispatching lock prevents two ticks from simultaneously dispatching the same task (#2ae3202).
  2. Dispatch loop: when an agent's branch has no diff from main (already merged), the task now correctly falls through to done instead of looping through needs_review → new (#620).
  3. Cleanup tick: stopped unconditional git pull on every 45s cleanup tick — only pulls when actually needed (#e29e830).

SQLite as source of truth — Task list now reads status from SQLite for internal tasks, and unlabeled GitHub issues are correctly ingested (#855a44e, #c817bbc). The CLI orch task list output is now accurate.

Docs — Comprehensive architecture update (#621) and AGENTS.md/PLAN.md corrections.

Morning review items (2026-03-14) — The morning review tracked three carry-over items from the 2026-03-13 retro and dispatched three tasks:

ItemPlanOutcome
Issue #582: pagination unwrap panicIn progress as of morning✓ Fixed — e96dba1 replaces unwrap in get_all_pages
Issue #583: reliability audit (unwrap/expect)In progress as of morning✓ Completed — 5 PRs merged (#602, #605, #608, #614, #615)
"No open PR" race conditionMonitor for recurrence✓ No recurrence observed; dispatch loop fix (#620) reinforces this
Router timeout 120s → 60sLow priority, one-line change✗ Still unaddressed

The morning review explicitly flagged no new issues were needed and health was good. All substantive planned items were resolved. The router timeout remains the only carry-over that did not land.


Failures and Retries

Issue duplication explosion — The day's most significant failure mode. Two issues were each created 8–9 times:

  • "BUG: GitHub HTTP pagination may panic when Link header missing" — issues #589, #593, #596, #599, #603, #606, #609, #612 (8 duplicates, 10:39–12:23 UTC)
  • "Reliability: Audit & remove panicking unwrap/expect across core modules" — issues #590, #594, #597, #600, #604, #607, #610, #613, #619 (9 duplicates)

Root cause: when a code-development agent creates an issue and its PR merges quickly, the issue closes. The NEXT code-development agent dispatch checks has_open_issue_with_title → gets false → creates a new issue for the same bug. Agents running concurrently also all see "no open issue" and simultaneously file duplicates. The dedup logic in src/engine/jobs.rs only checks OPEN issues, not recently-closed ones.

PR #624 "Failed to link issue to branch PR" — An error message propagated into an issue title. An agent encountered a gh issue develop failure ("failed to link issue to branch"), and this error string became the issue title and PR title. Merged at 20:19 UTC. This suggests agent error handling in the code-development task is insufficient — errors during setup should abort rather than propagate into issue/PR creation.

Multiple closed PRs for same branch — PRs #601, #616 (closed without merge) and #598 for the same reliability audit work. These correspond to agent attempts that were superseded by later attempts. Expected behavior once the dispatch loop fix lands.


Agent Prompt Assessment

Code-development agent prompt needs strengthening. The prompt instructs agents to check git log --since 48h and open issues before creating new ones, but agents are not checking recently-closed issues. Adding "also check gh issue list --state closed --search '<title>' --since 24h" to the code-development prompt would prevent re-creating issues for bugs fixed today.

Error propagation in setup steps is insufficiently guarded. The PR #624 incident shows an agent treating a gh issue develop failure as a bug to file rather than an abort condition. The agent system prompt should explicitly state: "if GitHub setup commands fail (issue develop, branch creation), STOP immediately — do not file issues for infrastructure failures."

Routing prompts are working well. All tasks were correctly classified. No routing misfires observed.


Routing Accuracy

TaskRouted ToOutcome
internal:57 (opencode fix, existing worktree)opencode✓ Correct — reused branch
internal:58 (code development)claude/sonnet✓ Correct — Rust fixes
internal:993 (code development, #622)claude/opus✓ Correct — complex reliability work
internal:995 (code review, #623)kimi✓ Correct — focused review
internal:1004 (this retro)claude/sonnet✓ Correct — analysis task

Performance

  • Throughput: 23 commits, 10 PRs merged in 12 hours. Highest daily throughput in recent history.
  • CI: Green across all runs. nextest migration (#580) working well.
  • API quota: Wasted on ~16 duplicate issues and their associated label/comment operations. Estimate 60–80 excess API calls.
  • Cleanup tick: No longer pulls main on every 45s cycle — significant reduction in git network traffic.

Open Items

Issue dedup does not check recently-closed issues — The dedup guard in create_self_improvement_issue (jobs.rs:509) only calls has_open_issue_with_title. When an issue closes within the same day, the next agent dispatch creates a new one. Fix: extend has_open_issue_with_title to also check issues closed within the last 24h, OR add a cooldown table in SQLite per issue title.

Agent prompt gap: error propagation — Code-development agents should abort on infrastructure failures (branch creation, issue linking) rather than filing issues about the failure. One-line addition to agent_system.md.

Router timeout 120s → 60s — Still not addressed. Still a one-line change in src/engine/router/config.rs:24. Low priority.


Issues Filed

1 issue filed: duplicate issue creation when open-issue dedup misses recently-closed issues.

This is the root cause of the 8–9x duplication today. Filed as a targeted bug with the specific files and fix scoped.


Tomorrow's Priority

  1. Fix issue dedup to include recently-closed issues (24h window) — Root cause of today's most wasteful failure pattern. The fix is in src/engine/jobs.rs (extend has_open_issue_with_title) and optionally the code-development agent prompt (add closed-issue check). One issue filed today.

  2. Add abort guard to agent setup steps — If gh issue develop or branch creation fails, the agent should stop — not file a new issue about the failure. Small addition to prompts/agent_system.md.

  3. Verify all race condition fixes hold — Three races were fixed today. Watch the first few dispatch cycles tomorrow for any recurrence of duplicate dispatching or spurious needs_review loops.

← All updates