Files
swarm-zap/memory/2026-03-13.md

14 KiB

2026-03-13

Subagent reliability investigation

  • Fresh implementation subagent launch for subagent/ACP reliability failed immediately before doing any task work.
  • Failure mode: delegated run was spawned with model glm-5, which resolved to provider model zai/glm-5.
  • Current installed agent auth profile keys inspected in agent stores include openai-codex:default, litellm:default, and github-copilot:github.
  • Will clarified on 2026-03-13 that Z.AI auth does exist in the environment, but the account is not entitled for glm-5.
  • Verified by inspecting agent auth profile keys under:
    • /home/openclaw/.openclaw/agents/*/agent/auth-profiles.json
  • Relevant OpenClaw docs confirm:
    • subagent spawns inherit caller model when sessions_spawn.model is omitted
    • provider/model auth errors like No API key found for provider "zai" occur when a provider model is selected without matching auth
    • multi-agent auth is per-agent via ~/.openclaw/agents/<agentId>/agent/auth-profiles.json
  • Conclusion: the immediate failure was caused by an incorrect explicit model selection in the spawn request, not by missing auth propagation between agents.
  • Corrective action: retry fresh delegation with litellm/glm-5 (the intended medium-tier routed model for delegated implementation work in this setup).
  • Will explicitly requested on 2026-03-13 to use gpt-5.4 for subagents for now while debugging delegation reliability.
  • New evidence from the corrected run: ~/.openclaw/agents/main/sessions/1615a980-cf92-4d5e-845a-a2abe77c0418.jsonl shows repeated assistant stopReason:"error" entries with 429 ... GLM-5 not included in current subscription plan, but ~/.openclaw/subagents/runs.json recorded run 776a8b51-6fdc-448e-83bc-55418814a05b as outcome.status: "ok" and frozenResultText: null.
  • That separates ACP/runtime choice problems from a generic subagent completion/reporting bug: a terminal assistant error can still be persisted/announced as success with no useful result.
  • Implemented upstream fix on branch external/openclaw-upstream@fix/subagent-wait-error-outcome:
    • added assistant terminal-outcome helper so empty-content assistant errors still yield usable terminal text
    • subagent registry now downgrades agent.wait => ok to error when the child session's terminal assistant message is actually an error
    • subagent announce flow now reports terminal assistant errors as failed outcomes instead of successful (no output) completions
  • Targeted validation passed:
    • pnpm -C /home/openclaw/.openclaw/workspace/external/openclaw-upstream test -- --run src/agents/tools/sessions-helpers.terminal-text.test.ts src/agents/subagent-registry.persistence.test.ts src/gateway/server-methods/server-methods.test.ts
    • result: 50 tests passed across 3 files
  • Real success-path verification later passed on gpt-5.4 with run 23750d80-b481-4f50-b219-cc9245be405f and final child result SUCCESS-PROBE-OK.
  • Real failure-path verification later also passed on valid gpt-5.4 by intentionally triggering a context_length_exceeded provider error with a token-dense oversized task payload.
    • child run: b50cb91f-6219-44f7-9d2f-a1264ac7ceaf
    • child session: agent:main:subagent:4c0dd686-cd2e-4cba-b80b-2fbf309a4594
    • child transcript: ~/.openclaw/agents/main/sessions/f114b831-000b-4070-a539-85c68d2b7057.jsonl
    • transcript terminal assistant entry recorded provider:"openai-codex", model:"gpt-5.4", stopReason:"error", errorMessage:"Codex error: {...context_length_exceeded...}"
    • matching ~/.openclaw/subagents/runs.json now correctly stored:
      • outcome.status: "error"
      • outcome.error: "Codex error: {...context_length_exceeded...}"
      • endedReason: "subagent-error"
      • frozenResultText: "Codex error: {...context_length_exceeded...}"
  • Important remaining nuance from the live repro: raw gateway agent.wait for that same failed child returned status:"ok" with only endedAt even though the child transcript terminal assistant message had stopReason:"error".
  • Follow-up code inspection on 2026-03-13 showed this is an upstream bug, not an intentional agent.wait layering choice:
    • embedded subscribe lifecycle already emits phase:"error" for terminal assistant/provider failures
    • but src/commands/agent.ts had a fallback lifecycle emitter that still sent phase:"end" whenever no inner lifecycle callback was observed, even if the resolved run result carried meta.stopReason:"error"
    • waitForAgentJob gives lifecycle errors a retry grace window, so that fallback end could overwrite the terminal failure and make agent.wait resolve ok
  • Implemented focused upstream follow-up on branch fix/subagent-wait-error-outcome:
    • src/commands/agent.ts now emits lifecycle phase:"error" with extracted terminal error text when a resolved run stops with meta.stopReason:"error" and no inner lifecycle callback fired
    • src/gateway/server-methods/agent-wait-dedupe.ts now also maps completed agent dedupe payloads with result.meta.stopReason:"error" to status:"error" and aborted:true to status:"timeout"
  • Targeted validation passed:
    • pnpm -C /home/openclaw/.openclaw/workspace/external/openclaw-upstream test -- --run src/commands/agent.test.ts src/gateway/server-methods/agent-wait-dedupe.test.ts src/gateway/server-methods/server-methods.test.ts
    • result: 81 tests passed across 3 files
  • Live runtime verification was re-run later on 2026-03-13 and showed the current agent.wait follow-up fix still does not hold on the live direct gateway path.
    • first temp-gateway sanity run via GatewayClient against loopback port 18901 on a persisted gpt-5.4 session returned status:"error", but only because that temp runtime reported FailoverError: Unknown model: openai-codex/gpt-5.4; useful as transport sanity, not canonical semantics proof
    • stale-dist temp gateway repro on default model (gpt-5.3-codex) already showed the mismatch:
      • session key: agent:main:subagent:agent-wait-gpt53-live-1773427893572
      • run id: gwc-live-agent-wait-gpt53-1773427893583
      • agent.wait: {"runId":"gwc-live-agent-wait-gpt53-1773427893583","status":"ok","endedAt":1773427896100}
      • last assistant still recorded stopReason:"error" with context_length_exceeded
    • decisive live source-gateway repro used a fresh source-run gateway on port 18902 launched with:
      • OPENCLAW_SKIP_CHANNELS=1 CLAWDBOT_SKIP_CHANNELS=1 pnpm exec tsx src/index.ts gateway run --port 18902 --bind loopback --auth none --allow-unconfigured
      • gateway log confirmed default model openai-codex/gpt-5.3-codex
      • session key: agent:main:subagent:agent-wait-gpt53-live-source-1773427981586
      • run id: gwc-live-agent-wait-gpt53-source-1773427981614
      • payload chars: 880150
      • start: {"runId":"gwc-live-agent-wait-gpt53-source-1773427981614","status":"accepted","acceptedAt":1773427981959}
      • agent.wait: {"runId":"gwc-live-agent-wait-gpt53-source-1773427981614","status":"ok","endedAt":1773427984243}
      • same session's terminal assistant message still recorded:
        • provider:"openai-codex"
        • model:"gpt-5.3-codex"
        • stopReason:"error"
        • errorMessage:"Codex error: {\"type\":\"error\",\"error\":{\"type\":\"invalid_request_error\",\"code\":\"context_length_exceeded\",\"message\":\"Your input exceeds the context window of this model. Please adjust your input and try again.\",\"param\":\"input\"},\"sequence_number\":2}"
  • Fast source inspection after that live repro points to the most likely remaining gap:
    • src/commands/agent.ts only emits the new corrective lifecycle phase:"error" when !lifecycleEnded
    • lifecycleEnded becomes true as soon as any inner lifecycle callback reports phase:"end" or phase:"error"
    • src/gateway/server-methods/agent-job.ts still treats lifecycle phase:"end" as terminal status:"ok"
    • so the likeliest still-open live bug is an inner lifecycle emitter marking terminal assistant/provider failures as end early enough that agent.wait resolves ok before the dedupe/result-meta rescue path matters
  • Net status at end of this pass:
    • subagent persistence/announcement fix: live-verified
    • raw agent.wait follow-up fix: tests passed, but live source-gateway repro still failed; do not mark this closed
  • Final focused live-fix pass on 2026-03-13 closed the remaining raw agent.wait bug.
    • root cause: the live direct gateway path could receive agent_end carrying a terminal assistant error without a preceding message_end, leaving stale/empty assistant state and still emitting lifecycle phase:"end"
    • final upstream fix taught embedded subscribe lifecycle handling to recover the terminal assistant from agent_end.messages / session transcript and emit lifecycle phase:"error", and taught the gateway agent RPC handler to derive terminal status from observed lifecycle + final result metadata instead of blindly caching ok
    • final targeted validation passed:
      • pnpm -C /home/openclaw/.openclaw/workspace/external/openclaw-upstream test -- --run src/agents/pi-embedded-subscribe.handlers.lifecycle.test.ts src/agents/pi-embedded-subscribe.subscribe-embedded-pi-session.subscribeembeddedpisession.test.ts src/commands/agent.test.ts src/gateway/server-methods/agent-wait-dedupe.test.ts src/gateway/server-methods/server-methods.test.ts
      • result: 108 tests passed across 5 files
    • decisive live source-gateway repro after the final fix:
      • gateway: source-run on port 18903
      • run id: gwc-live-agent-wait-gpt53-source-fixed2-1773429512008
      • final agent response returned finalStatus:"error"
      • matching agent.wait returned status:"error" with the same context-window error text
  • Net status now:
    • subagent persistence/announcement fix: live-verified
    • raw agent.wait semantics fix: live-verified
  • Side note: unrelated dirty /subagents log UX changes in external/openclaw-upstream regression-passed src/auto-reply/reply/commands.test.ts (44 tests) but were intentionally left out-of-scope for this focused reliability pass.

ACP Claude/Codex follow-up (post-agent.wait fix)

  • Historical deferred task task-20260304-211216-acp-claude-codex still referenced old failures Claude: acpx exited with code 1 and Codex: acpx exited with code 5, but those exact crashes were not reproduced in the latest focused pass.
  • Current host state check:
    • claude installed: /home/linuxbrew/.linuxbrew/bin/claude (2.1.63)
    • codex installed: /home/linuxbrew/.linuxbrew/bin/codex (0.107.0)
    • no global acpx on PATH, but bundled plugin-local runtime exists at ~/.local/share/pnpm/.../openclaw/extensions/acpx/node_modules/.bin/acpx
    • current ~/.openclaw/openclaw.json only showed plugins.entries.telegram.enabled=true; no explicit acp block / acpx plugin entry was present, so the smallest reliable repro path used the bundled acpx directly rather than a full OpenClaw ACP session
  • Live direct bundled-acpx repro results:
    • Codex command:
      • .../acpx --format json --json-strict --timeout 15 codex exec 'reply with OK only'
      • result: clean JSON-RPC/session stream ended with agent_message_chunk: "OK", id:2 result:{stopReason:"end_turn"}, process exit=0
    • Claude command:
      • .../acpx --format json --json-strict --timeout 20 claude exec 'reply with OK only'
      • stdout included top-level JSON-RPC errors:
        • {"jsonrpc":"2.0","id":2,"error":{"code":-32000,"message":"Authentication required"}}
        • {"jsonrpc":"2.0","id":null,"error":{"code":-32000,"message":"Authentication required"}}
      • process still exited 0
  • Source-level finding in external/openclaw-upstream/extensions/acpx/src/runtime-internals/events.ts:
    • prompt parsing handled typed {type:"error"} lines but dropped top-level JSON-RPC error responses
    • that meant runtime.runTurn() could treat a Claude auth failure as success (done) when the agent emitted JSON-RPC errors yet exited cleanly
  • Implemented focused upstream fix on branch fix/subagent-wait-error-outcome:
    • extensions/acpx/src/runtime-internals/events.ts
      • toAcpxErrorEvent() now also recognizes top-level JSON-RPC error responses via parseControlJsonError()
      • parsePromptEventLine() now emits ACP runtime type:"error" events for that shape instead of dropping it
    • added regression coverage:
      • extensions/acpx/src/runtime-internals/events.test.ts
      • extensions/acpx/src/runtime-internals/test-fixtures.ts
      • extensions/acpx/src/runtime.test.ts
  • Targeted validation passed:
    • cd /home/openclaw/.openclaw/workspace/external/openclaw-upstream && pnpm exec vitest run extensions/acpx/src/runtime-internals/events.test.ts extensions/acpx/src/runtime.test.ts extensions/acpx/src/runtime-internals/control-errors.test.ts
    • result: 22 tests passed across 3 files
  • Net status after this pass:
    • old acpx exited with code 1/5 reports remain historical evidence only
    • Codex ACP direct runtime path works today
    • Claude ACP direct runtime path currently fails for auth, and OpenClaw had a real bug in how the bundled acpx runtime parsed that failure shape
    • remaining follow-up is end-to-end OpenClaw ACP-path validation once ACP is explicitly configured here (or if a fresh exit-code repro appears)
  • Will also explicitly requested that zap keep a light eye on active subagents and check whether they look stuck instead of assuming they are fine until completion.
  • Will explicitly reinforced on 2026-03-13 that once planning is done, zap should use subagents ASAP and start implementation in a fresh session rather than continuing to implement inside the long-lived main chat.
  • Will explicitly asked on 2026-03-13 for more frequent checks on active subagent runs; zap should inspect/steer sooner instead of waiting for long silent stretches.