Morning Review -- 2026-03-24
Summary
Last 24h produced 40 commits and a service restart onto v0.26.0. The big event-bus, streaming, and router-pool changes from yesterday are live, and the service came back up cleanly after restart. But the pipeline is no longer empty: 7 open GitHub issues were filed this morning for new regressions in router fallback, parser robustness, review recovery, and runner diagnostics.
The main operational pattern is not broad instability - it is state reconciliation at the edges. Several tasks have working fixes or sensible fallback behavior, but still end up blocked because of push/auth failures, malformed-output parsing, or status drift after merge.
Recent Commits (last 24h)
| Commit | Issue | Description |
|---|---|---|
b99eeae | #868 | fix: read router pool from YAML lists |
cd1030f | #866 | feat: parse and format NDJSON stream output for human-readable orch stream |
9fc9f95 | #864 | feat: round-robin router LLM across multiple cheap/free models with safe fallback |
a872c7e | #862 | feat: model pools per complexity tier with cooldowns |
c1b6030 | -- | fix: clarifying-question responses must report blocked, not done |
eaafb18 | #859 | fix: review tick + subscriber double-trigger no longer double-count failures |
965187f | #860 | fix: review-agent tmux session cleaned up after rate limit |
7024cf1 | #852 | fix: recovered PR-create 422 no longer counts as a failure |
Notable Themes
Router pool rollout is live - the cheap/free multi-model router landed and the service now boots with a mixed pool (opencode, kimi, claude). This is the dominant change in today's operational profile.
Streaming got its second wave of hardening - NDJSON is now formatted for humans and stdout is streamed through tmux in real time, which makes live-session debugging much more practical.
Review-loop correctness improved - yesterday's fixes around duplicate review triggers, rate-limit cleanup, and recovered 422s are all in the recent commit set and directly reduce false retries.
Retro Priorities - Status
| Priority from 03-23 Retro | Status |
|---|---|
| Monitor event bus stability | Partial pass - no panics or subscriber crashes in the morning logs after the v0.26.0 restart |
| Streaming NDJSON smoke test | Partial pass - stream formatting shipped and runner logs show NDJSON handling, but no manual orch stream validation observed in this review window |
| Channel routing smoke test | Still pending - no deliberate cross-project smoke test yet |
| Webhook re-enable | Still pending - service restarted in polling fallback mode |
| Shared auth-classifier test coverage | Still pending |
Service Health
- Version: v0.26.0 after a clean 09:55 restart
- Open GitHub issues: 7 (
#873-#879), all operational bugs filed this morning - Open task queue in this repo: 6 blocked, 1 needs_review, 1 stale internal blocked task
- Owner-feedback blockers: None explicit; current blockers are automation failures or state drift
Log Patterns
Task #875 hit a push/auth failure despite a completed local fix - logs show SSH auth failing during push (agent refused operation), then the review path repeatedly tried to create a PR and hit No commits between main and <branch>. This is an infrastructure/state problem, not a new product bug, so no new issue was filed.
opencode still has an opaque exit_code=-1 failure mode - internal:8889 exited with no stdout/stderr and succeeded only after failover to claude. That lines up with open issue #874 about missing diagnostics when runner startup fails.
Router fallback is exercising the new pool in production - this morning the router recorded cooldowns for bad pool entries, including non-JSON/NDJSON responses and a timeout. That behavior is better than total failure, but it also surfaced the new open bugs #878 and #879.
Stuck Tasks
| Task | Status | Notes |
|---|---|---|
#875 | blocked | Fix appears committed locally, but push/auth failure prevented PR creation and left review recovery looping |
#877 | needs_review | Issue comments say the fix is complete and tests passed, but task state has not reconciled yet |
#873, #874, #876, #878, #879 | blocked | Freshly filed operational bugs; all failed twice and are waiting for follow-up runs |
internal:8068 | blocked | Evening retrospective task is still blocked even though PR #854 merged successfully |
internal:8068 is the most suspicious stale-state case this morning. The post merged at 23:26Z, but the internal task still shows blocked ~10 hours later. That suggests task status reconciliation after merge is still leaky for at least one internal-task path.
Operational Checks
- Are tasks stuck or failing repeatedly? Yes - the current pattern is repeated edge-case failure after otherwise successful work: parser/output handling, router-pool fallback, review recovery, and push/auth handoff.
- Are there error patterns in logs? Yes - SSH push/auth refusal, repeated PR-create 422 follow-ups on
#875, opaqueexit_code=-1agent failures, and router pool entries that return NDJSON or timeout. - Did the evening retro flag anything? Yes - event bus monitoring, streaming smoke test, channel-routing smoke test, webhook re-enable, and shared auth-classifier tests. Only the first is showing early signs of stability; the rest remain open.
- Are tasks waiting on owner feedback? Not directly. The queue is blocked by automation and state-management issues, not missing product decisions.
Today's Priorities
- Clear the blocked issue queue (
#873-#879) - especially#875and#877, where the code may already be fixed but task state is wrong. - Investigate stale status reconciliation for
internal:8068- merged PRs should not leave internal cron tasks blocked overnight. - Harden router/output parsing around NDJSON and malformed JSON blocks - the new pool is already surfacing real response-shape variability.
- Improve runner diagnostics for startup failures -
exit_code=-1with empty output is still too opaque for reliable auto-recovery. - Run the long-pending channel-routing smoke test - the codebase is stable enough that this should move from passive monitoring to an explicit check.
No new GitHub issues were created during this review because the operational problems found this morning are already represented by the current open issue queue.