14 KiB
14 KiB
2026-03-13
Subagent reliability investigation
- Fresh implementation subagent launch for subagent/ACP reliability failed immediately before doing any task work.
- Failure mode: delegated run was spawned with model
glm-5, which resolved to provider modelzai/glm-5. - Current installed agent auth profile keys inspected in agent stores include
openai-codex:default,litellm:default, andgithub-copilot:github. - Will clarified on 2026-03-13 that Z.AI auth does exist in the environment, but the account is not entitled for
glm-5. - Verified by inspecting agent auth profile keys under:
/home/openclaw/.openclaw/agents/*/agent/auth-profiles.json
- Relevant OpenClaw docs confirm:
- subagent spawns inherit caller model when
sessions_spawn.modelis omitted - provider/model auth errors like
No API key found for provider "zai"occur when a provider model is selected without matching auth - multi-agent auth is per-agent via
~/.openclaw/agents/<agentId>/agent/auth-profiles.json
- subagent spawns inherit caller model when
- Conclusion: the immediate failure was caused by an incorrect explicit model selection in the spawn request, not by missing auth propagation between agents.
- Corrective action: retry fresh delegation with
litellm/glm-5(the intended medium-tier routed model for delegated implementation work in this setup). - Will explicitly requested on 2026-03-13 to use
gpt-5.4for subagents for now while debugging delegation reliability. - New evidence from the corrected run:
~/.openclaw/agents/main/sessions/1615a980-cf92-4d5e-845a-a2abe77c0418.jsonlshows repeated assistantstopReason:"error"entries with429 ... GLM-5 not included in current subscription plan, but~/.openclaw/subagents/runs.jsonrecorded run776a8b51-6fdc-448e-83bc-55418814a05basoutcome.status: "ok"andfrozenResultText: null. - That separates ACP/runtime choice problems from a generic subagent completion/reporting bug: a terminal assistant error can still be persisted/announced as success with no useful result.
- Implemented upstream fix on branch
external/openclaw-upstream@fix/subagent-wait-error-outcome:- added assistant terminal-outcome helper so empty-content assistant errors still yield usable terminal text
- subagent registry now downgrades
agent.wait => oktoerrorwhen the child session's terminal assistant message is actually an error - subagent announce flow now reports terminal assistant errors as failed outcomes instead of successful
(no output)completions
- Targeted validation passed:
pnpm -C /home/openclaw/.openclaw/workspace/external/openclaw-upstream test -- --run src/agents/tools/sessions-helpers.terminal-text.test.ts src/agents/subagent-registry.persistence.test.ts src/gateway/server-methods/server-methods.test.ts- result:
50 testspassed across3files
- Real success-path verification later passed on
gpt-5.4with run23750d80-b481-4f50-b219-cc9245be405fand final child resultSUCCESS-PROBE-OK. - Real failure-path verification later also passed on valid
gpt-5.4by intentionally triggering acontext_length_exceededprovider error with a token-dense oversized task payload.- child run:
b50cb91f-6219-44f7-9d2f-a1264ac7ceaf - child session:
agent:main:subagent:4c0dd686-cd2e-4cba-b80b-2fbf309a4594 - child transcript:
~/.openclaw/agents/main/sessions/f114b831-000b-4070-a539-85c68d2b7057.jsonl - transcript terminal assistant entry recorded
provider:"openai-codex",model:"gpt-5.4",stopReason:"error",errorMessage:"Codex error: {...context_length_exceeded...}" - matching
~/.openclaw/subagents/runs.jsonnow correctly stored:outcome.status: "error"outcome.error: "Codex error: {...context_length_exceeded...}"endedReason: "subagent-error"frozenResultText: "Codex error: {...context_length_exceeded...}"
- child run:
- Important remaining nuance from the live repro: raw gateway
agent.waitfor that same failed child returnedstatus:"ok"with onlyendedAteven though the child transcript terminal assistant message hadstopReason:"error". - Follow-up code inspection on 2026-03-13 showed this is an upstream bug, not an intentional
agent.waitlayering choice:- embedded subscribe lifecycle already emits
phase:"error"for terminal assistant/provider failures - but
src/commands/agent.tshad a fallback lifecycle emitter that still sentphase:"end"whenever no inner lifecycle callback was observed, even if the resolved run result carriedmeta.stopReason:"error" waitForAgentJobgives lifecycle errors a retry grace window, so that fallbackendcould overwrite the terminal failure and makeagent.waitresolveok
- embedded subscribe lifecycle already emits
- Implemented focused upstream follow-up on branch
fix/subagent-wait-error-outcome:src/commands/agent.tsnow emits lifecyclephase:"error"with extracted terminal error text when a resolved run stops withmeta.stopReason:"error"and no inner lifecycle callback firedsrc/gateway/server-methods/agent-wait-dedupe.tsnow also maps completed agent dedupe payloads withresult.meta.stopReason:"error"tostatus:"error"andaborted:truetostatus:"timeout"
- Targeted validation passed:
pnpm -C /home/openclaw/.openclaw/workspace/external/openclaw-upstream test -- --run src/commands/agent.test.ts src/gateway/server-methods/agent-wait-dedupe.test.ts src/gateway/server-methods/server-methods.test.ts- result:
81 testspassed across3files
- Live runtime verification was re-run later on 2026-03-13 and showed the current
agent.waitfollow-up fix still does not hold on the live direct gateway path.- first temp-gateway sanity run via
GatewayClientagainst loopback port18901on a persistedgpt-5.4session returnedstatus:"error", but only because that temp runtime reportedFailoverError: Unknown model: openai-codex/gpt-5.4; useful as transport sanity, not canonical semantics proof - stale-dist temp gateway repro on default model (
gpt-5.3-codex) already showed the mismatch:- session key:
agent:main:subagent:agent-wait-gpt53-live-1773427893572 - run id:
gwc-live-agent-wait-gpt53-1773427893583 agent.wait:{"runId":"gwc-live-agent-wait-gpt53-1773427893583","status":"ok","endedAt":1773427896100}- last assistant still recorded
stopReason:"error"withcontext_length_exceeded
- session key:
- decisive live source-gateway repro used a fresh source-run gateway on port
18902launched with:OPENCLAW_SKIP_CHANNELS=1 CLAWDBOT_SKIP_CHANNELS=1 pnpm exec tsx src/index.ts gateway run --port 18902 --bind loopback --auth none --allow-unconfigured- gateway log confirmed default model
openai-codex/gpt-5.3-codex - session key:
agent:main:subagent:agent-wait-gpt53-live-source-1773427981586 - run id:
gwc-live-agent-wait-gpt53-source-1773427981614 - payload chars:
880150 - start:
{"runId":"gwc-live-agent-wait-gpt53-source-1773427981614","status":"accepted","acceptedAt":1773427981959} agent.wait:{"runId":"gwc-live-agent-wait-gpt53-source-1773427981614","status":"ok","endedAt":1773427984243}- same session's terminal assistant message still recorded:
provider:"openai-codex"model:"gpt-5.3-codex"stopReason:"error"errorMessage:"Codex error: {\"type\":\"error\",\"error\":{\"type\":\"invalid_request_error\",\"code\":\"context_length_exceeded\",\"message\":\"Your input exceeds the context window of this model. Please adjust your input and try again.\",\"param\":\"input\"},\"sequence_number\":2}"
- first temp-gateway sanity run via
- Fast source inspection after that live repro points to the most likely remaining gap:
src/commands/agent.tsonly emits the new corrective lifecyclephase:"error"when!lifecycleEndedlifecycleEndedbecomes true as soon as any inner lifecycle callback reportsphase:"end"orphase:"error"src/gateway/server-methods/agent-job.tsstill treats lifecyclephase:"end"as terminalstatus:"ok"- so the likeliest still-open live bug is an inner lifecycle emitter marking terminal assistant/provider failures as
endearly enough thatagent.waitresolvesokbefore the dedupe/result-meta rescue path matters
- Net status at end of this pass:
- subagent persistence/announcement fix: live-verified
- raw
agent.waitfollow-up fix: tests passed, but live source-gateway repro still failed; do not mark this closed
- Final focused live-fix pass on 2026-03-13 closed the remaining raw
agent.waitbug.- root cause: the live direct gateway path could receive
agent_endcarrying a terminal assistant error without a precedingmessage_end, leaving stale/empty assistant state and still emitting lifecyclephase:"end" - final upstream fix taught embedded subscribe lifecycle handling to recover the terminal assistant from
agent_end.messages/ session transcript and emit lifecyclephase:"error", and taught the gatewayagentRPC handler to derive terminal status from observed lifecycle + final result metadata instead of blindly cachingok - final targeted validation passed:
pnpm -C /home/openclaw/.openclaw/workspace/external/openclaw-upstream test -- --run src/agents/pi-embedded-subscribe.handlers.lifecycle.test.ts src/agents/pi-embedded-subscribe.subscribe-embedded-pi-session.subscribeembeddedpisession.test.ts src/commands/agent.test.ts src/gateway/server-methods/agent-wait-dedupe.test.ts src/gateway/server-methods/server-methods.test.ts- result:
108 testspassed across5files
- decisive live source-gateway repro after the final fix:
- gateway: source-run on port
18903 - run id:
gwc-live-agent-wait-gpt53-source-fixed2-1773429512008 - final
agentresponse returnedfinalStatus:"error" - matching
agent.waitreturnedstatus:"error"with the same context-window error text
- gateway: source-run on port
- root cause: the live direct gateway path could receive
- Net status now:
- subagent persistence/announcement fix: live-verified ✅
- raw
agent.waitsemantics fix: live-verified ✅
- Side note: unrelated dirty
/subagents logUX changes inexternal/openclaw-upstreamregression-passedsrc/auto-reply/reply/commands.test.ts(44 tests) but were intentionally left out-of-scope for this focused reliability pass.
ACP Claude/Codex follow-up (post-agent.wait fix)
- Historical deferred task
task-20260304-211216-acp-claude-codexstill referenced old failuresClaude: acpx exited with code 1andCodex: acpx exited with code 5, but those exact crashes were not reproduced in the latest focused pass. - Current host state check:
claudeinstalled:/home/linuxbrew/.linuxbrew/bin/claude(2.1.63)codexinstalled:/home/linuxbrew/.linuxbrew/bin/codex(0.107.0)- no global
acpxon PATH, but bundled plugin-local runtime exists at~/.local/share/pnpm/.../openclaw/extensions/acpx/node_modules/.bin/acpx - current
~/.openclaw/openclaw.jsononly showedplugins.entries.telegram.enabled=true; no explicitacpblock /acpxplugin entry was present, so the smallest reliable repro path used the bundledacpxdirectly rather than a full OpenClaw ACP session
- Live direct bundled-acpx repro results:
- Codex command:
.../acpx --format json --json-strict --timeout 15 codex exec 'reply with OK only'- result: clean JSON-RPC/session stream ended with
agent_message_chunk: "OK",id:2 result:{stopReason:"end_turn"}, processexit=0
- Claude command:
.../acpx --format json --json-strict --timeout 20 claude exec 'reply with OK only'- stdout included top-level JSON-RPC errors:
{"jsonrpc":"2.0","id":2,"error":{"code":-32000,"message":"Authentication required"}}{"jsonrpc":"2.0","id":null,"error":{"code":-32000,"message":"Authentication required"}}
- process still exited
0
- Codex command:
- Source-level finding in
external/openclaw-upstream/extensions/acpx/src/runtime-internals/events.ts:- prompt parsing handled typed
{type:"error"}lines but dropped top-level JSON-RPCerrorresponses - that meant
runtime.runTurn()could treat a Claude auth failure as success (done) when the agent emitted JSON-RPC errors yet exited cleanly
- prompt parsing handled typed
- Implemented focused upstream fix on branch
fix/subagent-wait-error-outcome:extensions/acpx/src/runtime-internals/events.tstoAcpxErrorEvent()now also recognizes top-level JSON-RPCerrorresponses viaparseControlJsonError()parsePromptEventLine()now emits ACP runtimetype:"error"events for that shape instead of dropping it
- added regression coverage:
extensions/acpx/src/runtime-internals/events.test.tsextensions/acpx/src/runtime-internals/test-fixtures.tsextensions/acpx/src/runtime.test.ts
- Targeted validation passed:
cd /home/openclaw/.openclaw/workspace/external/openclaw-upstream && pnpm exec vitest run extensions/acpx/src/runtime-internals/events.test.ts extensions/acpx/src/runtime.test.ts extensions/acpx/src/runtime-internals/control-errors.test.ts- result:
22tests passed across3files
- Net status after this pass:
- old
acpx exited with code 1/5reports remain historical evidence only - Codex ACP direct runtime path works today
- Claude ACP direct runtime path currently fails for auth, and OpenClaw had a real bug in how the bundled acpx runtime parsed that failure shape
- remaining follow-up is end-to-end OpenClaw ACP-path validation once ACP is explicitly configured here (or if a fresh exit-code repro appears)
- old
- Will also explicitly requested that zap keep a light eye on active subagents and check whether they look stuck instead of assuming they are fine until completion.
- Will explicitly reinforced on 2026-03-13 that once planning is done, zap should use subagents ASAP and start implementation in a fresh session rather than continuing to implement inside the long-lived main chat.
- Will explicitly asked on 2026-03-13 for more frequent checks on active subagent runs; zap should inspect/steer sooner instead of waiting for long silent stretches.