2026-03-13

Subagent reliability investigation

Fresh implementation subagent launch for subagent/ACP reliability failed immediately before doing any task work.
Failure mode: delegated run was spawned with model glm-5, which resolved to provider model zai/glm-5.
Current installed agent auth profile keys inspected in agent stores include openai-codex:default, litellm:default, and github-copilot:github.
Will clarified on 2026-03-13 that Z.AI auth does exist in the environment, but the account is not entitled for glm-5.
Verified by inspecting agent auth profile keys under:
- /home/openclaw/.openclaw/agents/*/agent/auth-profiles.json
Relevant OpenClaw docs confirm:
- subagent spawns inherit caller model when sessions_spawn.model is omitted
- provider/model auth errors like No API key found for provider "zai" occur when a provider model is selected without matching auth
- multi-agent auth is per-agent via ~/.openclaw/agents/<agentId>/agent/auth-profiles.json
Conclusion: the immediate failure was caused by an incorrect explicit model selection in the spawn request, not by missing auth propagation between agents.
Corrective action: retry fresh delegation with litellm/glm-5 (the intended medium-tier routed model for delegated implementation work in this setup).
Will explicitly requested on 2026-03-13 to use gpt-5.4 for subagents for now while debugging delegation reliability.
New evidence from the corrected run: ~/.openclaw/agents/main/sessions/1615a980-cf92-4d5e-845a-a2abe77c0418.jsonl shows repeated assistant stopReason:"error" entries with 429 ... GLM-5 not included in current subscription plan, but ~/.openclaw/subagents/runs.json recorded run 776a8b51-6fdc-448e-83bc-55418814a05b as outcome.status: "ok" and frozenResultText: null.
That separates ACP/runtime choice problems from a generic subagent completion/reporting bug: a terminal assistant error can still be persisted/announced as success with no useful result.
Implemented upstream fix on branch external/openclaw-upstream@fix/subagent-wait-error-outcome:
- added assistant terminal-outcome helper so empty-content assistant errors still yield usable terminal text
- subagent registry now downgrades agent.wait => ok to error when the child session's terminal assistant message is actually an error
- subagent announce flow now reports terminal assistant errors as failed outcomes instead of successful (no output) completions
Targeted validation passed:
- pnpm -C /home/openclaw/.openclaw/workspace/external/openclaw-upstream test -- --run src/agents/tools/sessions-helpers.terminal-text.test.ts src/agents/subagent-registry.persistence.test.ts src/gateway/server-methods/server-methods.test.ts
- result: 50 tests passed across 3 files
Real success-path verification later passed on gpt-5.4 with run 23750d80-b481-4f50-b219-cc9245be405f and final child result SUCCESS-PROBE-OK.
Real failure-path verification later also passed on valid gpt-5.4 by intentionally triggering a context_length_exceeded provider error with a token-dense oversized task payload.
- child run: b50cb91f-6219-44f7-9d2f-a1264ac7ceaf
- child session: agent:main:subagent:4c0dd686-cd2e-4cba-b80b-2fbf309a4594
- child transcript: ~/.openclaw/agents/main/sessions/f114b831-000b-4070-a539-85c68d2b7057.jsonl
- transcript terminal assistant entry recorded provider:"openai-codex", model:"gpt-5.4", stopReason:"error", errorMessage:"Codex error: {...context_length_exceeded...}"
- matching ~/.openclaw/subagents/runs.json now correctly stored:
  - outcome.status: "error"
  - outcome.error: "Codex error: {...context_length_exceeded...}"
  - endedReason: "subagent-error"
  - frozenResultText: "Codex error: {...context_length_exceeded...}"
Important remaining nuance from the live repro: raw gateway agent.wait for that same failed child returned status:"ok" with only endedAt even though the child transcript terminal assistant message had stopReason:"error".
Follow-up code inspection on 2026-03-13 showed this is an upstream bug, not an intentional agent.wait layering choice:
- embedded subscribe lifecycle already emits phase:"error" for terminal assistant/provider failures
- but src/commands/agent.ts had a fallback lifecycle emitter that still sent phase:"end" whenever no inner lifecycle callback was observed, even if the resolved run result carried meta.stopReason:"error"
- waitForAgentJob gives lifecycle errors a retry grace window, so that fallback end could overwrite the terminal failure and make agent.wait resolve ok
Implemented focused upstream follow-up on branch fix/subagent-wait-error-outcome:
- src/commands/agent.ts now emits lifecycle phase:"error" with extracted terminal error text when a resolved run stops with meta.stopReason:"error" and no inner lifecycle callback fired
- src/gateway/server-methods/agent-wait-dedupe.ts now also maps completed agent dedupe payloads with result.meta.stopReason:"error" to status:"error" and aborted:true to status:"timeout"
Targeted validation passed:
- pnpm -C /home/openclaw/.openclaw/workspace/external/openclaw-upstream test -- --run src/commands/agent.test.ts src/gateway/server-methods/agent-wait-dedupe.test.ts src/gateway/server-methods/server-methods.test.ts
- result: 81 tests passed across 3 files
Live runtime verification of the new agent.wait fix was not re-run in this pass; current evidence is exact code-path inspection plus focused tests.
Side note: unrelated dirty /subagents log UX changes in external/openclaw-upstream regression-passed src/auto-reply/reply/commands.test.ts (44 tests) but were intentionally left out-of-scope for this focused reliability pass.
Will also explicitly requested that zap keep a light eye on active subagents and check whether they look stuck instead of assuming they are fine until completion.
Will explicitly reinforced on 2026-03-13 that once planning is done, zap should use subagents ASAP and start implementation in a fresh session rather than continuing to implement inside the long-lived main chat.

6.1 KiB Raw Blame History

2026-03-13

Subagent reliability investigation

6.1 KiB

Raw Blame History