swarm-zap/memory/2026-03-13.md

# 2026-03-13

## Subagent reliability investigation
- Fresh implementation subagent launch for subagent/ACP reliability failed immediately before doing any task work.
- Failure mode: delegated run was spawned with model `glm-5`, which resolved to provider model `zai/glm-5`.
- Current installed agent auth profile keys inspected in agent stores include `openai-codex:default`, `litellm:default`, and `github-copilot:github`.
- Will clarified on 2026-03-13 that Z.AI auth does exist in the environment, but the account is not entitled for `glm-5`.
- Verified by inspecting agent auth profile keys under:
  - `/home/openclaw/.openclaw/agents/*/agent/auth-profiles.json`
- Relevant OpenClaw docs confirm:
  - subagent spawns inherit caller model when `sessions_spawn.model` is omitted
  - provider/model auth errors like `No API key found for provider "zai"` occur when a provider model is selected without matching auth
  - multi-agent auth is per-agent via `~/.openclaw/agents/<agentId>/agent/auth-profiles.json`
- Conclusion: the immediate failure was caused by an incorrect explicit model selection in the spawn request, not by missing auth propagation between agents.
- Corrective action: retry fresh delegation with `litellm/glm-5` (the intended medium-tier routed model for delegated implementation work in this setup).
- Will explicitly requested on 2026-03-13 to use `gpt-5.4` for subagents for now while debugging delegation reliability.
- New evidence from the corrected run: `~/.openclaw/agents/main/sessions/1615a980-cf92-4d5e-845a-a2abe77c0418.jsonl` shows repeated assistant `stopReason:"error"` entries with `429 ... GLM-5 not included in current subscription plan`, but `~/.openclaw/subagents/runs.json` recorded run `776a8b51-6fdc-448e-83bc-55418814a05b` as `outcome.status: "ok"` and `frozenResultText: null`.
- That separates ACP/runtime choice problems from a generic subagent completion/reporting bug: a terminal assistant error can still be persisted/announced as success with no useful result.
- Implemented upstream fix on branch `external/openclaw-upstream@fix/subagent-wait-error-outcome`:
  - added assistant terminal-outcome helper so empty-content assistant errors still yield usable terminal text
  - subagent registry now downgrades `agent.wait => ok` to `error` when the child session's terminal assistant message is actually an error
  - subagent announce flow now reports terminal assistant errors as failed outcomes instead of successful `(no output)` completions
- Targeted validation passed:
  - `pnpm -C /home/openclaw/.openclaw/workspace/external/openclaw-upstream test -- --run src/agents/tools/sessions-helpers.terminal-text.test.ts src/agents/subagent-registry.persistence.test.ts src/gateway/server-methods/server-methods.test.ts`
  - result: `50 tests` passed across `3` files
- Real success-path verification later passed on `gpt-5.4` with run `23750d80-b481-4f50-b219-cc9245be405f` and final child result `SUCCESS-PROBE-OK`.
- Real failure-path verification later also passed on valid `gpt-5.4` by intentionally triggering a `context_length_exceeded` provider error with a token-dense oversized task payload.
  - child run: `b50cb91f-6219-44f7-9d2f-a1264ac7ceaf`
  - child session: `agent:main:subagent:4c0dd686-cd2e-4cba-b80b-2fbf309a4594`
  - child transcript: `~/.openclaw/agents/main/sessions/f114b831-000b-4070-a539-85c68d2b7057.jsonl`
  - transcript terminal assistant entry recorded `provider:"openai-codex"`, `model:"gpt-5.4"`, `stopReason:"error"`, `errorMessage:"Codex error: {...context_length_exceeded...}"`
  - matching `~/.openclaw/subagents/runs.json` now correctly stored:
    - `outcome.status: "error"`
    - `outcome.error: "Codex error: {...context_length_exceeded...}"`
    - `endedReason: "subagent-error"`
    - `frozenResultText: "Codex error: {...context_length_exceeded...}"`
- Important remaining nuance from the live repro: raw gateway `agent.wait` for that same failed child returned `status:"ok"` with only `endedAt` even though the child transcript terminal assistant message had `stopReason:"error"`.
- Follow-up code inspection on 2026-03-13 showed this is an upstream bug, not an intentional `agent.wait` layering choice:
  - embedded subscribe lifecycle already emits `phase:"error"` for terminal assistant/provider failures
  - but `src/commands/agent.ts` had a fallback lifecycle emitter that still sent `phase:"end"` whenever no inner lifecycle callback was observed, even if the resolved run result carried `meta.stopReason:"error"`
  - `waitForAgentJob` gives lifecycle errors a retry grace window, so that fallback `end` could overwrite the terminal failure and make `agent.wait` resolve `ok`
- Implemented focused upstream follow-up on branch `fix/subagent-wait-error-outcome`:
  - `src/commands/agent.ts` now emits lifecycle `phase:"error"` with extracted terminal error text when a resolved run stops with `meta.stopReason:"error"` and no inner lifecycle callback fired
  - `src/gateway/server-methods/agent-wait-dedupe.ts` now also maps completed agent dedupe payloads with `result.meta.stopReason:"error"` to `status:"error"` and `aborted:true` to `status:"timeout"`
- Targeted validation passed:
  - `pnpm -C /home/openclaw/.openclaw/workspace/external/openclaw-upstream test -- --run src/commands/agent.test.ts src/gateway/server-methods/agent-wait-dedupe.test.ts src/gateway/server-methods/server-methods.test.ts`
  - result: `81 tests` passed across `3` files
- Live runtime verification was re-run later on 2026-03-13 and showed the current `agent.wait` follow-up fix still does **not** hold on the live direct gateway path.
  - first temp-gateway sanity run via `GatewayClient` against loopback port `18901` on a persisted `gpt-5.4` session returned `status:"error"`, but only because that temp runtime reported `FailoverError: Unknown model: openai-codex/gpt-5.4`; useful as transport sanity, not canonical semantics proof
  - stale-dist temp gateway repro on default model (`gpt-5.3-codex`) already showed the mismatch:
    - session key: `agent:main:subagent:agent-wait-gpt53-live-1773427893572`
    - run id: `gwc-live-agent-wait-gpt53-1773427893583`
    - `agent.wait`: `{"runId":"gwc-live-agent-wait-gpt53-1773427893583","status":"ok","endedAt":1773427896100}`
    - last assistant still recorded `stopReason:"error"` with `context_length_exceeded`
  - decisive live source-gateway repro used a fresh source-run gateway on port `18902` launched with:
    - `OPENCLAW_SKIP_CHANNELS=1 CLAWDBOT_SKIP_CHANNELS=1 pnpm exec tsx src/index.ts gateway run --port 18902 --bind loopback --auth none --allow-unconfigured`
    - gateway log confirmed default model `openai-codex/gpt-5.3-codex`
    - session key: `agent:main:subagent:agent-wait-gpt53-live-source-1773427981586`
    - run id: `gwc-live-agent-wait-gpt53-source-1773427981614`
    - payload chars: `880150`
    - start: `{"runId":"gwc-live-agent-wait-gpt53-source-1773427981614","status":"accepted","acceptedAt":1773427981959}`
    - `agent.wait`: `{"runId":"gwc-live-agent-wait-gpt53-source-1773427981614","status":"ok","endedAt":1773427984243}`
    - same session's terminal assistant message still recorded:
      - `provider:"openai-codex"`
      - `model:"gpt-5.3-codex"`
      - `stopReason:"error"`
      - `errorMessage:"Codex error: {\"type\":\"error\",\"error\":{\"type\":\"invalid_request_error\",\"code\":\"context_length_exceeded\",\"message\":\"Your input exceeds the context window of this model. Please adjust your input and try again.\",\"param\":\"input\"},\"sequence_number\":2}"`
- Fast source inspection after that live repro points to the most likely remaining gap:
  - `src/commands/agent.ts` only emits the new corrective lifecycle `phase:"error"` when `!lifecycleEnded`
  - `lifecycleEnded` becomes true as soon as any inner lifecycle callback reports `phase:"end"` or `phase:"error"`
  - `src/gateway/server-methods/agent-job.ts` still treats lifecycle `phase:"end"` as terminal `status:"ok"`
  - so the likeliest still-open live bug is an inner lifecycle emitter marking terminal assistant/provider failures as `end` early enough that `agent.wait` resolves `ok` before the dedupe/result-meta rescue path matters
- Net status at end of this pass:
  - subagent persistence/announcement fix: live-verified
  - raw `agent.wait` follow-up fix: tests passed, but live source-gateway repro still failed; do not mark this closed
- Final focused live-fix pass on 2026-03-13 closed the remaining raw `agent.wait` bug.
  - root cause: the live direct gateway path could receive `agent_end` carrying a terminal assistant error without a preceding `message_end`, leaving stale/empty assistant state and still emitting lifecycle `phase:"end"`
  - final upstream fix taught embedded subscribe lifecycle handling to recover the terminal assistant from `agent_end.messages` / session transcript and emit lifecycle `phase:"error"`, and taught the gateway `agent` RPC handler to derive terminal status from observed lifecycle + final result metadata instead of blindly caching `ok`
  - final targeted validation passed:
    - `pnpm -C /home/openclaw/.openclaw/workspace/external/openclaw-upstream test -- --run src/agents/pi-embedded-subscribe.handlers.lifecycle.test.ts src/agents/pi-embedded-subscribe.subscribe-embedded-pi-session.subscribeembeddedpisession.test.ts src/commands/agent.test.ts src/gateway/server-methods/agent-wait-dedupe.test.ts src/gateway/server-methods/server-methods.test.ts`
    - result: `108 tests` passed across `5` files
  - decisive live source-gateway repro after the final fix:
    - gateway: source-run on port `18903`
    - run id: `gwc-live-agent-wait-gpt53-source-fixed2-1773429512008`
    - final `agent` response returned `finalStatus:"error"`
    - matching `agent.wait` returned `status:"error"` with the same context-window error text
- Net status now:
  - subagent persistence/announcement fix: live-verified ✅
  - raw `agent.wait` semantics fix: live-verified ✅
- Side note: unrelated dirty `/subagents log` UX changes in `external/openclaw-upstream` regression-passed `src/auto-reply/reply/commands.test.ts` (44 tests) but were intentionally left out-of-scope for this focused reliability pass.

## ACP Claude/Codex follow-up (post-`agent.wait` fix)
- Historical deferred task `task-20260304-211216-acp-claude-codex` still referenced old failures `Claude: acpx exited with code 1` and `Codex: acpx exited with code 5`, but those exact crashes were **not** reproduced in the latest focused pass.
- Current host state check:
  - `claude` installed: `/home/linuxbrew/.linuxbrew/bin/claude` (`2.1.63`)
  - `codex` installed: `/home/linuxbrew/.linuxbrew/bin/codex` (`0.107.0`)
  - no global `acpx` on PATH, but bundled plugin-local runtime exists at `~/.local/share/pnpm/.../openclaw/extensions/acpx/node_modules/.bin/acpx`
  - current `~/.openclaw/openclaw.json` only showed `plugins.entries.telegram.enabled=true`; no explicit `acp` block / `acpx` plugin entry was present, so the smallest reliable repro path used the bundled `acpx` directly rather than a full OpenClaw ACP session
- Live direct bundled-acpx repro results:
  - Codex command:
    - `.../acpx --format json --json-strict --timeout 15 codex exec 'reply with OK only'`
    - result: clean JSON-RPC/session stream ended with `agent_message_chunk: "OK"`, `id:2 result:{stopReason:"end_turn"}`, process `exit=0`
  - Claude command:
    - `.../acpx --format json --json-strict --timeout 20 claude exec 'reply with OK only'`
    - stdout included top-level JSON-RPC errors:
      - `{"jsonrpc":"2.0","id":2,"error":{"code":-32000,"message":"Authentication required"}}`
      - `{"jsonrpc":"2.0","id":null,"error":{"code":-32000,"message":"Authentication required"}}`
    - process still exited `0`
- Source-level finding in `external/openclaw-upstream/extensions/acpx/src/runtime-internals/events.ts`:
  - prompt parsing handled typed `{type:"error"}` lines but dropped top-level JSON-RPC `error` responses
  - that meant `runtime.runTurn()` could treat a Claude auth failure as success (`done`) when the agent emitted JSON-RPC errors yet exited cleanly
- Implemented focused upstream fix on branch `fix/subagent-wait-error-outcome`:
  - `extensions/acpx/src/runtime-internals/events.ts`
    - `toAcpxErrorEvent()` now also recognizes top-level JSON-RPC `error` responses via `parseControlJsonError()`
    - `parsePromptEventLine()` now emits ACP runtime `type:"error"` events for that shape instead of dropping it
  - added regression coverage:
    - `extensions/acpx/src/runtime-internals/events.test.ts`
    - `extensions/acpx/src/runtime-internals/test-fixtures.ts`
    - `extensions/acpx/src/runtime.test.ts`
- Targeted validation passed:
  - `cd /home/openclaw/.openclaw/workspace/external/openclaw-upstream && pnpm exec vitest run extensions/acpx/src/runtime-internals/events.test.ts extensions/acpx/src/runtime.test.ts extensions/acpx/src/runtime-internals/control-errors.test.ts`
  - result: `22` tests passed across `3` files
- Net status after this pass:
  - old `acpx exited with code 1/5` reports remain historical evidence only
  - Codex ACP direct runtime path works today
  - Claude ACP direct runtime path currently fails for auth, and OpenClaw had a real bug in how the bundled acpx runtime parsed that failure shape
  - remaining follow-up is end-to-end OpenClaw ACP-path validation once ACP is explicitly configured here (or if a fresh exit-code repro appears)
- Will also explicitly requested that zap keep a light eye on active subagents and check whether they look stuck instead of assuming they are fine until completion.
- Will explicitly reinforced on 2026-03-13 that once planning is done, zap should use subagents ASAP and start implementation in a fresh session rather than continuing to implement inside the long-lived main chat.
- Will explicitly asked on 2026-03-13 for more frequent checks on active subagent runs; zap should inspect/steer sooner instead of waiting for long silent stretches.