# 2026-03-13 ## Subagent reliability investigation - Fresh implementation subagent launch for subagent/ACP reliability failed immediately before doing any task work. - Failure mode: delegated run was spawned with model `glm-5`, which resolved to provider model `zai/glm-5`. - Current installed agent auth profile keys inspected in agent stores include `openai-codex:default`, `litellm:default`, and `github-copilot:github`. - Will clarified on 2026-03-13 that Z.AI auth does exist in the environment, but the account is not entitled for `glm-5`. - Verified by inspecting agent auth profile keys under: - `/home/openclaw/.openclaw/agents/*/agent/auth-profiles.json` - Relevant OpenClaw docs confirm: - subagent spawns inherit caller model when `sessions_spawn.model` is omitted - provider/model auth errors like `No API key found for provider "zai"` occur when a provider model is selected without matching auth - multi-agent auth is per-agent via `~/.openclaw/agents//agent/auth-profiles.json` - Conclusion: the immediate failure was caused by an incorrect explicit model selection in the spawn request, not by missing auth propagation between agents. - Corrective action: retry fresh delegation with `litellm/glm-5` (the intended medium-tier routed model for delegated implementation work in this setup). - Will explicitly requested on 2026-03-13 to use `gpt-5.4` for subagents for now while debugging delegation reliability. - New evidence from the corrected run: `~/.openclaw/agents/main/sessions/1615a980-cf92-4d5e-845a-a2abe77c0418.jsonl` shows repeated assistant `stopReason:"error"` entries with `429 ... GLM-5 not included in current subscription plan`, but `~/.openclaw/subagents/runs.json` recorded run `776a8b51-6fdc-448e-83bc-55418814a05b` as `outcome.status: "ok"` and `frozenResultText: null`. - That separates ACP/runtime choice problems from a generic subagent completion/reporting bug: a terminal assistant error can still be persisted/announced as success with no useful result. - Implemented upstream fix on branch `external/openclaw-upstream@fix/subagent-wait-error-outcome`: - added assistant terminal-outcome helper so empty-content assistant errors still yield usable terminal text - subagent registry now downgrades `agent.wait => ok` to `error` when the child session's terminal assistant message is actually an error - subagent announce flow now reports terminal assistant errors as failed outcomes instead of successful `(no output)` completions - Targeted validation passed: - `pnpm -C /home/openclaw/.openclaw/workspace/external/openclaw-upstream test -- --run src/agents/tools/sessions-helpers.terminal-text.test.ts src/agents/subagent-registry.persistence.test.ts src/gateway/server-methods/server-methods.test.ts` - result: `50 tests` passed across `3` files - Real success-path verification later passed on `gpt-5.4` with run `23750d80-b481-4f50-b219-cc9245be405f` and final child result `SUCCESS-PROBE-OK`. - Real failure-path verification later also passed on valid `gpt-5.4` by intentionally triggering a `context_length_exceeded` provider error with a token-dense oversized task payload. - child run: `b50cb91f-6219-44f7-9d2f-a1264ac7ceaf` - child session: `agent:main:subagent:4c0dd686-cd2e-4cba-b80b-2fbf309a4594` - child transcript: `~/.openclaw/agents/main/sessions/f114b831-000b-4070-a539-85c68d2b7057.jsonl` - transcript terminal assistant entry recorded `provider:"openai-codex"`, `model:"gpt-5.4"`, `stopReason:"error"`, `errorMessage:"Codex error: {...context_length_exceeded...}"` - matching `~/.openclaw/subagents/runs.json` now correctly stored: - `outcome.status: "error"` - `outcome.error: "Codex error: {...context_length_exceeded...}"` - `endedReason: "subagent-error"` - `frozenResultText: "Codex error: {...context_length_exceeded...}"` - Important remaining nuance from the live repro: raw gateway `agent.wait` for that same failed child returned `status:"ok"` with only `endedAt` even though the child transcript terminal assistant message had `stopReason:"error"`. - Follow-up code inspection on 2026-03-13 showed this is an upstream bug, not an intentional `agent.wait` layering choice: - embedded subscribe lifecycle already emits `phase:"error"` for terminal assistant/provider failures - but `src/commands/agent.ts` had a fallback lifecycle emitter that still sent `phase:"end"` whenever no inner lifecycle callback was observed, even if the resolved run result carried `meta.stopReason:"error"` - `waitForAgentJob` gives lifecycle errors a retry grace window, so that fallback `end` could overwrite the terminal failure and make `agent.wait` resolve `ok` - Implemented focused upstream follow-up on branch `fix/subagent-wait-error-outcome`: - `src/commands/agent.ts` now emits lifecycle `phase:"error"` with extracted terminal error text when a resolved run stops with `meta.stopReason:"error"` and no inner lifecycle callback fired - `src/gateway/server-methods/agent-wait-dedupe.ts` now also maps completed agent dedupe payloads with `result.meta.stopReason:"error"` to `status:"error"` and `aborted:true` to `status:"timeout"` - Targeted validation passed: - `pnpm -C /home/openclaw/.openclaw/workspace/external/openclaw-upstream test -- --run src/commands/agent.test.ts src/gateway/server-methods/agent-wait-dedupe.test.ts src/gateway/server-methods/server-methods.test.ts` - result: `81 tests` passed across `3` files - Live runtime verification was re-run later on 2026-03-13 and showed the current `agent.wait` follow-up fix still does **not** hold on the live direct gateway path. - first temp-gateway sanity run via `GatewayClient` against loopback port `18901` on a persisted `gpt-5.4` session returned `status:"error"`, but only because that temp runtime reported `FailoverError: Unknown model: openai-codex/gpt-5.4`; useful as transport sanity, not canonical semantics proof - stale-dist temp gateway repro on default model (`gpt-5.3-codex`) already showed the mismatch: - session key: `agent:main:subagent:agent-wait-gpt53-live-1773427893572` - run id: `gwc-live-agent-wait-gpt53-1773427893583` - `agent.wait`: `{"runId":"gwc-live-agent-wait-gpt53-1773427893583","status":"ok","endedAt":1773427896100}` - last assistant still recorded `stopReason:"error"` with `context_length_exceeded` - decisive live source-gateway repro used a fresh source-run gateway on port `18902` launched with: - `OPENCLAW_SKIP_CHANNELS=1 CLAWDBOT_SKIP_CHANNELS=1 pnpm exec tsx src/index.ts gateway run --port 18902 --bind loopback --auth none --allow-unconfigured` - gateway log confirmed default model `openai-codex/gpt-5.3-codex` - session key: `agent:main:subagent:agent-wait-gpt53-live-source-1773427981586` - run id: `gwc-live-agent-wait-gpt53-source-1773427981614` - payload chars: `880150` - start: `{"runId":"gwc-live-agent-wait-gpt53-source-1773427981614","status":"accepted","acceptedAt":1773427981959}` - `agent.wait`: `{"runId":"gwc-live-agent-wait-gpt53-source-1773427981614","status":"ok","endedAt":1773427984243}` - same session's terminal assistant message still recorded: - `provider:"openai-codex"` - `model:"gpt-5.3-codex"` - `stopReason:"error"` - `errorMessage:"Codex error: {\"type\":\"error\",\"error\":{\"type\":\"invalid_request_error\",\"code\":\"context_length_exceeded\",\"message\":\"Your input exceeds the context window of this model. Please adjust your input and try again.\",\"param\":\"input\"},\"sequence_number\":2}"` - Fast source inspection after that live repro points to the most likely remaining gap: - `src/commands/agent.ts` only emits the new corrective lifecycle `phase:"error"` when `!lifecycleEnded` - `lifecycleEnded` becomes true as soon as any inner lifecycle callback reports `phase:"end"` or `phase:"error"` - `src/gateway/server-methods/agent-job.ts` still treats lifecycle `phase:"end"` as terminal `status:"ok"` - so the likeliest still-open live bug is an inner lifecycle emitter marking terminal assistant/provider failures as `end` early enough that `agent.wait` resolves `ok` before the dedupe/result-meta rescue path matters - Net status at end of this pass: - subagent persistence/announcement fix: live-verified - raw `agent.wait` follow-up fix: tests passed, but live source-gateway repro still failed; do not mark this closed - Final focused live-fix pass on 2026-03-13 closed the remaining raw `agent.wait` bug. - root cause: the live direct gateway path could receive `agent_end` carrying a terminal assistant error without a preceding `message_end`, leaving stale/empty assistant state and still emitting lifecycle `phase:"end"` - final upstream fix taught embedded subscribe lifecycle handling to recover the terminal assistant from `agent_end.messages` / session transcript and emit lifecycle `phase:"error"`, and taught the gateway `agent` RPC handler to derive terminal status from observed lifecycle + final result metadata instead of blindly caching `ok` - final targeted validation passed: - `pnpm -C /home/openclaw/.openclaw/workspace/external/openclaw-upstream test -- --run src/agents/pi-embedded-subscribe.handlers.lifecycle.test.ts src/agents/pi-embedded-subscribe.subscribe-embedded-pi-session.subscribeembeddedpisession.test.ts src/commands/agent.test.ts src/gateway/server-methods/agent-wait-dedupe.test.ts src/gateway/server-methods/server-methods.test.ts` - result: `108 tests` passed across `5` files - decisive live source-gateway repro after the final fix: - gateway: source-run on port `18903` - run id: `gwc-live-agent-wait-gpt53-source-fixed2-1773429512008` - final `agent` response returned `finalStatus:"error"` - matching `agent.wait` returned `status:"error"` with the same context-window error text - Net status now: - subagent persistence/announcement fix: live-verified ✅ - raw `agent.wait` semantics fix: live-verified ✅ - Side note: unrelated dirty `/subagents log` UX changes in `external/openclaw-upstream` regression-passed `src/auto-reply/reply/commands.test.ts` (44 tests) but were intentionally left out-of-scope for this focused reliability pass. ## ACP Claude/Codex follow-up (post-`agent.wait` fix) - Historical deferred task `task-20260304-211216-acp-claude-codex` still referenced old failures `Claude: acpx exited with code 1` and `Codex: acpx exited with code 5`, but those exact crashes were **not** reproduced in the latest focused pass. - Current host state check: - `claude` installed: `/home/linuxbrew/.linuxbrew/bin/claude` (`2.1.63`) - `codex` installed: `/home/linuxbrew/.linuxbrew/bin/codex` (`0.107.0`) - no global `acpx` on PATH, but bundled plugin-local runtime exists at `~/.local/share/pnpm/.../openclaw/extensions/acpx/node_modules/.bin/acpx` - current `~/.openclaw/openclaw.json` only showed `plugins.entries.telegram.enabled=true`; no explicit `acp` block / `acpx` plugin entry was present, so the smallest reliable repro path used the bundled `acpx` directly rather than a full OpenClaw ACP session - Live direct bundled-acpx repro results: - Codex command: - `.../acpx --format json --json-strict --timeout 15 codex exec 'reply with OK only'` - result: clean JSON-RPC/session stream ended with `agent_message_chunk: "OK"`, `id:2 result:{stopReason:"end_turn"}`, process `exit=0` - Claude command: - `.../acpx --format json --json-strict --timeout 20 claude exec 'reply with OK only'` - stdout included top-level JSON-RPC errors: - `{"jsonrpc":"2.0","id":2,"error":{"code":-32000,"message":"Authentication required"}}` - `{"jsonrpc":"2.0","id":null,"error":{"code":-32000,"message":"Authentication required"}}` - process still exited `0` - Source-level finding in `external/openclaw-upstream/extensions/acpx/src/runtime-internals/events.ts`: - prompt parsing handled typed `{type:"error"}` lines but dropped top-level JSON-RPC `error` responses - that meant `runtime.runTurn()` could treat a Claude auth failure as success (`done`) when the agent emitted JSON-RPC errors yet exited cleanly - Implemented focused upstream fix on branch `fix/subagent-wait-error-outcome`: - `extensions/acpx/src/runtime-internals/events.ts` - `toAcpxErrorEvent()` now also recognizes top-level JSON-RPC `error` responses via `parseControlJsonError()` - `parsePromptEventLine()` now emits ACP runtime `type:"error"` events for that shape instead of dropping it - added regression coverage: - `extensions/acpx/src/runtime-internals/events.test.ts` - `extensions/acpx/src/runtime-internals/test-fixtures.ts` - `extensions/acpx/src/runtime.test.ts` - Targeted validation passed: - `cd /home/openclaw/.openclaw/workspace/external/openclaw-upstream && pnpm exec vitest run extensions/acpx/src/runtime-internals/events.test.ts extensions/acpx/src/runtime.test.ts extensions/acpx/src/runtime-internals/control-errors.test.ts` - result: `22` tests passed across `3` files - Net status after this pass: - old `acpx exited with code 1/5` reports remain historical evidence only - Codex ACP direct runtime path works today - Claude ACP direct runtime path currently fails for auth, and OpenClaw had a real bug in how the bundled acpx runtime parsed that failure shape - remaining follow-up is end-to-end OpenClaw ACP-path validation once ACP is explicitly configured here (or if a fresh exit-code repro appears) - Will also explicitly requested that zap keep a light eye on active subagents and check whether they look stuck instead of assuming they are fine until completion. - Will explicitly reinforced on 2026-03-13 that once planning is done, zap should use subagents ASAP and start implementation in a fresh session rather than continuing to implement inside the long-lived main chat. - Will explicitly asked on 2026-03-13 for more frequent checks on active subagent runs; zap should inspect/steer sooner instead of waiting for long silent stretches.