Morning Review — 2026-04-21
Recent Commits (last 24h)
15 commits focused on reliability and correctness:
ff93af99fix(tasks): remove pre-emptive set_block_reason(None) before conditional update (#2892)c1324329fix(parser): remove dead best_status_known variable (#2891)0be53d3cbug: parser failures from opencode models (parse_error) and empty/invalid responses (#2887)87dc514fbug(sync): auto-merge CI failures can leave tasks blocked without reconciliation (#2886)5a047047fix(parser): tighten NDJSON candidate selection to reject bogus status values (#2885)e0fde494bug(sync): merged PRs can stay blocked indefinitely after CI-failure escalation (#2884)c0005835fix(discord_ws): wrap all websocket send operations with 10s timeout (#2878)5a464178bug(sync): NeedsReview refire escalation triggers after 4 refires instead of 5 (off-by-one) (#2872)29dcf9a2perf(router): Regex::new called in loop for static ASCII patterns (#2877)cb6162b2fix(store): propagate decode errors for created_at/updated_at (#2875)29b22ff0fix(patterns): map lowercased byte offsets back to original string (#2870)945ce6a3fix(router): use floor_char_boundary for safe UTF-8 slicing (#2869)c7b88e0aReview.awaitusage while holding Mutexes (#2868)5aedd08afix(router): replace lowercased-index slicing with case-insensitive regex (#2866)
Operational Health
Service and logs
- Watchdog stalls observed: Two tick stalls detected (89s at 13:44 UTC, 79s at 13:45 UTC) — both triggered during LLM routing budget exhaustion (45s) + fallback to round-robin + task dispatch. The engine recovered automatically, but this pattern indicates LLM routing is timing out frequently.
- LLM routing budget exceeded: Multiple tasks fell back to round-robin in the last hour. This is causing tick delays that trigger the 60s watchdog threshold.
- Kimi degraded: Pre-emptive health check marked Kimi as degraded (agent in cooldown).
/opt/homebrew/var/log/orch.error.logis stale (last modified 2026-04-19), no current-run errors.
Task/run health (24h)
Outcomes (160 total):
- success: ~116
- failed: ~20
- rate_limit: ~5
- parse_error: ~3
- timeout: ~1
- empty outcome: ~3
Top agent/model outcomes:
minimax/opus: 28 success (healthiest)claude/sonnet: 20 success, 3 failed, 1 timeout, 1 emptyopencode/minimax-m2.5-free: 17 success, 3 failed, 1 emptycodex/gpt-5.3-codex: 16 success, 2 failed, 2 emptyopencode/gpt-5-mini: 15 successglm/opus: 8 success, 2 rate_limit
task_activity (12h): status_change (43), dispatch (18), branch_delete (14), routed (9), push (8), review_start (4), review_decision (4), pr_create (3), error (2), rerouted (1).
No broad engine stall — instability remains in routing timeouts and specific model lanes.
Stuck / Blocked Work
#2789(blocked): GLM artifact collection. Still waiting.#2881(new): task_runs.error stores raw api_retry JSON fragments, masking real error. Created 2026-04-20.
Retro Follow-up Status (from 2026-04-20 evening)
- Close #2831: done — decode-error/metrics sweep complete, issue should be closable.
- File/fix #2881: in progress — issue was created yesterday, needs fix.
- GLM investigation (#2789): still blocked, pending artifact collection.
- Parse_error patterns: still occurring — 3 parse_errors in 24h from minimax/opus, nemotron, and minimax/free models.
orch stream --pipevalidation: still pending.
Tasks Waiting on Owner Feedback
- No open issues currently labeled
needs-feedback.
Priorities for Today
- Diagnose LLM routing budget timeouts: Multiple watchdog stalls traced to 45s LLM routing budget exhaustion. Consider reducing budget or optimizing routing path.
- Fix #2881: task_runs.error JSON fragment storage — quick sanitize needed.
- Investigate parse_error patterns: 3 in 24h, need sample outputs to tighten parser further.
- Continue GLM investigation: #2789 still blocked.
Issue Creation
No new operational issues created in this review. #2881 was already created yesterday.
The service is functional but routing performance needs attention — the 45s LLM budget is causing cascading delays that trigger watchdog stalls.
Prepared by Orch automation (internal task internal:146650).