Morning Review — 2026-04-24
Recent Commits (since 2026-04-21)
10 commits focused on correctness and resilience:
17812bdefix: drop tmux session detached69f9b567fix(cron): normalize_dow handles degenerate range 0-0 correctly5e9b82f7fix(git_ops): fail loudly when auto-commit cleanup cannot unstage (#2920)73b37c32fix(silence): skip stale-generation entries and increase grace period to 5min (#2960)41a940ebfeat(cooldown): add extended-backoff tier for persistently failing models (#2944)0f78b48dbug(runner): GH_TOKEN is injected after tmux process start, so agent never sees it (#2929)b57c673fbug(store): append_memory/recent_memory silently drop corrupt memory entries (#2928)21372c1ebug(security): scan() uses find() instead of find_iter(), missing multiple secrets per line (#2925)39a70adefix(store): wrap status-change methods in transactions (#2900)769a23f5docs: morning review 2026-04-21 (#2893)
Notable changes:
- GH_TOKEN fix (
#2929): agent runner was injectingGH_TOKENas an env var set inside the tmux process, which meant it was never available to the agent subprocess. Should resolve git push failures that were caused by missing auth. - Extended backoff tier (
#2944): adds a new cooldown tier for persistently failing models — escalating from the standard backoff. Important for GLM's ongoing rate-limit issues. - Security scan fix (
#2925):scan()was usingfind()(first match only) instead offind_iter()(all matches), so multi-secret lines could leak.
Operational Health
Service and logs
- Watchdog stall observed: 80s stall at 10:06 UTC — routing took 55s (LLM budget exceeded + round-robin fallback), pushing the full tick over the 60s watchdog threshold. This is the same pattern as yesterday.
- No evening retrospectives: No evening retros found for 04-22, 04-23, or 04-24 — the evening task may have been skipped or failed on those days.
- Error log is empty (
/opt/homebrew/var/log/orch.error.logis 0B). - Auto-merge SSH failures: Both
internal:148542andinternal:148554(bean project) failed to rebase —sign_and_send_pubkey: signing failed for ED25519from agent. TheGH_TOKENfix (#2929) should help but doesn't address SSH agent key failures. This is a separate issue from git auth.
Task/run health (24h)
Total runs: ~222 (from task_runs)
- Success: 185+ runs (claude/sonnet 70, codex/gpt-5.3-codex 27, minimax/opus 24, opencode/gpt-5-mini 22, claude/opus 15, etc.)
- GLM (opus): 6 success, 3 failed, 3 rate_limit, 1 timeout — GLM is still struggling with rate limits and failures. The extended-backoff tier (#2944) should help here.
- opencode/gpt-5.3: 3 failures
- opencode/gpt-5.4: 3 failures
- kimi/opus: 2 failures, 1 rate_limit, 1 aborted, 1 push_failed — degraded state continues
- parse_error: 1 from claude/sonnet, 1 from opencode/gemini-3.1-pro-preview
- aborted: 2 from claude/sonnet
task_activity (12h): status_change (1579), dispatch (417), push (388), review_start (292), review_decision (280), error (198), branch_delete (121), routed (113), pr_create (105), rerouted (19), timeout (3).
Overall: healthy throughput with GLM and Kimi degraded.
Stuck / Blocked Work
#2789(blocked): GLM artifact collection. Still waiting.#2881(blocked): task_runs.error stores raw api_retry JSON fragments. Still needs fix.internal:148540(blocked): self-improvement task — review agent exceeded failure threshold. Needs human attention.internal:148556(blocked): twitter bookmarks research, blocked on codex.
Large batch of old Solana/oblivion/keeper tasks remain blocked from the Oblivion engine (all CI failure limit (3) reached during auto-merge). These appear to be old tasks that won't resolve without human intervention.
Retro Follow-up Status (from 2026-04-21 morning)
- LLM routing budget timeouts: Still causing watchdog stalls (80s today). The routing budget is still 45s — hasn't been tuned yet.
- Fix #2881 (task_runs.error JSON fragments): Still blocked, needs fix.
- GLM investigation (#2789): Still blocked, pending artifact collection.
- Parse_error patterns: Down to 2 in 24h — better but still occurring.
orch stream --pipevalidation: No evidence of this being done.- No evening retros for the past 3 days — the evening task may need investigation.
Tasks Waiting on Owner Feedback
- No open issues currently labeled
needs-feedback.
Priorities for Today
- Monitor GLM recovery: Extended-backoff tier (#2944) was just deployed. Watch if GLM rate limits decrease.
- Investigate auto-merge SSH failures for
beanproject — bothinternal:148542andinternal:148554failed to rebase due to ED25519 key failures. TheGH_TOKENfix won't help SSH; separate root cause. - Fix #2881: task_runs.error JSON fragment storage — low effort, high value.
- Resolve evening retro gap: 3 days without retrospectives. Check if the evening task is firing or failing silently.
- Consider tuning LLM routing budget to reduce watchdog stalls — the 45s budget is causing 80s+ tick times.
Issue Creation
No new operational issues. The main patterns (LLM routing timeout, GLM rate limits, SSH failures in bean) are either known or need more data before filing.
Prepared by Orch automation (internal task internal:148558).