# WIP.subagent-reliability.md ## Status Status: `open` Owner: `zap` Opened: `2026-03-13` ## Purpose Investigate and improve subagent / ACP delegation reliability, including timeout behavior, runtime failures, and delayed/duplicate completion-event noise. ## Why now This is the highest-leverage remaining open reliability item because it affects trust in delegation and the usability of fresh implementation runs. ## Related tasks - `task-20260304-2215-subagent-reliability` — in progress - `task-20260304-211216-acp-claude-codex` — open ## Known context - Prior work already patched TUI formatting to suppress internal runtime completion context blocks. - Upstream patch exists in `external/openclaw-upstream` on branch `fix/tui-hide-internal-runtime-context` commit `0f66a4547`. - User explicitly wants subagent tooling reliability fixed and completion-event spam prevented. - Fresh-session implementation discipline and monitoring thresholds were already documented locally. ## Goals for this pass 1. Establish the current failure modes with concrete evidence. 2. Separate ACP-specific failures from generic subagent/session issues. 3. Determine what is already fixed versus still broken. 4. Produce a concrete recommendation and, if feasible in one pass, implement the highest-confidence fix. 5. Update task/memory state with evidence before ending. ## Suggested investigation plan 1. Review current OpenClaw docs and local memory around subagent/ACP failures. 2. Reproduce or inspect recent failures using session/task evidence instead of guessing. 3. Check current runtime status / relevant logs / known local patches. 4. If the issue is in OpenClaw core, work in `external/openclaw-upstream/` on a focused branch. 5. Validate with the smallest reliable reproduction possible. ## Evidence gathered so far - Fresh subagent run failed immediately with provider auth error for `zai` before any task execution. - Current installed agent auth profiles include `openai-codex:default`, `litellm:default`, and `github-copilot:github`; there is no `zai` profile configured. - Root cause for this immediate repro appears to be an incorrect explicit spawn model choice (`glm-5` alias → `zai/glm-5`) rather than missing auth propagation between agents. - Next step after confirming the model-selection issue: prefer `gpt-5.4` for fresh subagent reliability/debug passes for now, per Will's instruction, and continue separating real runtime issues from operator/config mistakes. ## Constraints - Prefer evidence over theory. - Do not claim a fix without concrete validation. - Keep the main session clean; use this file as the canonical baton. ## Success criteria - Clear diagnosis of the current reliability problem(s). - At least one of: - implemented fix with validation, or - sharply scoped next fix plan with exact evidence and files. - `memory/2026-03-13.md` (or current daily note), `memory/tasks.json`, and this WIP updated.