18 KiB
18 KiB
WIP.subagent-reliability.md
Status
Status: follow-up
Owner: zap
Opened: 2026-03-13
Last updated: 2026-03-13
Purpose
Investigate and improve subagent / ACP delegation reliability, including timeout behavior, runtime failures, and delayed/duplicate completion-event noise.
Current state
- The core reliability thread tracked in this WIP is now fixed and live-verified on
external/openclaw-upstreambranchfix/subagent-wait-error-outcome. - Verified fixed:
- subagent persistence / announcement handling for terminal assistant-provider failures
- raw
agent.waitsemantics for the live direct gateway path
- Key upstream commits on this branch:
2a2ed0d6f—fix(subagents): derive outcome from terminal assistant errors5a328d22b—fix(agent): surface terminal run errors in wait semanticsf9a78e8f7—fix(gateway): honor terminal assistant errors in live wait path
Why this file is still open
- The broader delegation reliability task is not fully done yet.
- Remaining follow-up work is now narrower:
- ACP-specific Claude/Codex runtime failures / final live OpenClaw ACP validation
- optional separate
/subagents logUX cleanup - push/PR the focused upstream reliability branch when desired
Related tasks
task-20260304-2215-subagent-reliability— in progresstask-20260304-211216-acp-claude-codex— open
Known context
- Prior work already patched TUI formatting to suppress internal runtime completion context blocks.
- Upstream patch exists in
external/openclaw-upstreamon branchfix/tui-hide-internal-runtime-contextcommit0f66a4547. - User explicitly wants subagent tooling reliability fixed and completion-event spam prevented.
- Fresh-session implementation discipline and monitoring thresholds were already documented locally.
Immediate baton
- Do not reopen the solved
agent.waitinvestigation unless a fresh repro appears. - If this project is resumed next, start with real OpenClaw ACP-path validation of the new acpx JSON-RPC error handling (or capture a fresh Claude/Codex end-to-end repro if ACP still is not configured here).
- Treat the historical
acpx exited with code 1/5note as unresolved-but-unreproduced; do not spend more time on it without fresh evidence. - Treat
/subagents logUX edits as a separate branch/pass so they do not muddy the reliability fix branch.
Evidence gathered so far
- Fresh subagent run failed immediately when an explicit
glm-5choice resolved into the Z.AI provider path before any useful task execution. - Current installed agent auth profile keys inspected in agent stores include
openai-codex:default,litellm:default, andgithub-copilot:github. - Will clarified that Z.AI auth does exist, but this account is not entitled for
glm-5. - Root cause for this immediate repro is therefore best described as a provider/model entitlement mismatch caused by the explicit spawn model choice, not missing auth propagation between agents.
- A later "corrected" run using
litellm/glm-5also did not succeed: child transcript~/.openclaw/agents/main/sessions/1615a980-cf92-4d5e-845a-a2abe77c0418.jsonlcontains repeated assistantstopReason:"error"entries with429 ... subscription plan does not yet include access to GLM-5, while~/.openclaw/subagents/runs.jsonrecorded that run (776a8b51-6fdc-448e-83bc-55418814a05b) asoutcome.status: "ok"withfrozenResultText: null. - This separates the problems:
- ACP/operator/model-selection issue: explicit
glm-5→zai/glm-5without auth (already understood). - Generic subagent completion/reporting issue: terminal assistant errors can still be stored/announced as successful completion with no frozen result.
- ACP/operator/model-selection issue: explicit
- Implemented upstream patch on branch
fix/subagent-wait-error-outcomeinexternal/openclaw-upstreamso subagent completion paths inspect the latest assistant terminal message and treat terminal assistant errors asoutcome.status: "error"rather thanok. - Validation completed for targeted non-E2E coverage:
pnpm -C external/openclaw-upstream test -- --run src/agents/tools/sessions-helpers.terminal-text.test.ts src/agents/subagent-registry.persistence.test.ts src/gateway/server-methods/server-methods.test.ts- result: passed (
50 testsacross3files).
- E2E-style
subagent-announce.format.e2e.test.tscoverage was updated but the normal Vitest include rules exclude*.e2e.test.ts; directpnpm test -- --run ...e2e...confirms exclusion rather than executing that file. - Tried to take over live verification directly in the main session on 2026-03-13:
- confirmed upstream branch
fix/subagent-wait-error-outcomeis present with commit2a2ed0d6f - confirmed normal packaged gateway was healthy before attempting runtime verification
- first direct hot-swap attempt was interrupted at gateway stop time; systemd restored the packaged gateway cleanly
- no patched upstream gateway was left running after that attempt
- confirmed upstream branch
- Current state: upstream patch + targeted tests are real.
- Real subagent success verification now completed on
gpt-5.4:- run id:
23750d80-b481-4f50-b219-cc9245be405f - child session:
agent:main:subagent:ad2cc776-2527-4078-ab83-0220dbd09509 - result: successful completion with a real final child result (
SUCCESS-PROBE-OK)
- run id:
- A later GLM-5 probe was invalid for entitlement reasons and was terminated; it should not be treated as the canonical failure-path verification.
- killed/failed run id:
4965775c-4764-41e9-a77a-692f1ab4c2fd
- killed/failed run id:
- Live failure-path verification on a valid working model/runtime is now complete on
gpt-5.4.- spawned child run:
b50cb91f-6219-44f7-9d2f-a1264ac7ceaf - requester session:
agent:main:subagent-reliability-failure-hex-1773425126098 - child session:
agent:main:subagent:4c0dd686-cd2e-4cba-b80b-2fbf309a4594 - child transcript:
~/.openclaw/agents/main/sessions/f114b831-000b-4070-a539-85c68d2b7057.jsonl - terminal child assistant message (transcript line 6) recorded:
provider: "openai-codex"model: "gpt-5.4"stopReason: "error"errorMessage: "Codex error: {\"type\":\"error\",\"error\":{\"type\":\"invalid_request_error\",\"code\":\"context_length_exceeded\",\"message\":\"Your input exceeds the context window of this model. Please adjust your input and try again.\",\"param\":\"input\"},\"sequence_number\":2}"
- matching
~/.openclaw/subagents/runs.jsonrecord now correctly persisted:outcome.status: "error"outcome.error: "Codex error: {...context_length_exceeded...}"endedReason: "subagent-error"frozenResultText: "Codex error: {...context_length_exceeded...}"
- spawned child run:
- Important nuance from the same live repro: raw gateway
agent.waitstill returned{"runId":"b50cb91f-6219-44f7-9d2f-a1264ac7ceaf","status":"ok","endedAt":1773425130881}for that failed child. So the current fix is verified for persisted/announced subagent outcomes, but not for the lower-levelagent.waitRPC semantics. - Follow-up code inspection on 2026-03-13 found that the
agent.waitmismatch is a real upstream bug, not intentional layering:src/agents/pi-embedded-subscribe.handlers.lifecycle.tsalready treats terminal assistantstopReason:"error"as lifecyclephase:"error".src/gateway/server-methods/agent-wait-dedupe.tsnow also interprets resolved agent RPC payloads withresult.meta.stopReason:"error"as terminalstatus:"error"(andaborted:trueastimeout).- but
src/commands/agent.tsstill had a fallback path that unconditionally emitted lifecyclephase:"end"whenever no inner lifecycle callback was observed, even if the resolved run result carriedmeta.stopReason:"error". - because
waitForAgentJobgives lifecycle errors a retry grace window, that fallbackendcould overwrite the earlier failed state and make rawagent.waitresolvestatus:"ok"for a terminal assistant/provider error.
- Implemented the smallest focused upstream fix on branch
fix/subagent-wait-error-outcome:src/commands/agent.tsnow emits lifecyclephase:"error"(with extracted terminal error text) when a resolved run stops withmeta.stopReason:"error"and no inner lifecycle callback fired.src/commands/agent.test.tsadds coverage for that fallback path.src/gateway/server-methods/agent-wait-dedupe.ts+agent-wait-dedupe.test.tscover the dedupe snapshot path so completed agent RPC payloads with terminal assistant errors/timeouts also map toerror/timeoutinstead of stayingok.
- Targeted validation for this follow-up passed:
pnpm -C external/openclaw-upstream test -- --run src/commands/agent.test.ts src/gateway/server-methods/agent-wait-dedupe.test.ts src/gateway/server-methods/server-methods.test.ts- result: passed (
81 testsacross3files).
- Follow-up live runtime verification on 2026-03-13 showed the current
agent.waitfix did not close the live path yet.- patched gateway launched directly from source on loopback with channels skipped:
- command:
OPENCLAW_SKIP_CHANNELS=1 CLAWDBOT_SKIP_CHANNELS=1 pnpm exec tsx src/index.ts gateway run --port 18902 --bind loopback --auth none --allow-unconfigured - log evidence:
2026-03-13T18:52:10.743+00:00 [gateway] agent model: openai-codex/gpt-5.3-codex
- command:
- live repro used a fresh default-model session and an oversized in-memory payload over
GatewayClient(not CLI argv):- session key:
agent:main:subagent:agent-wait-gpt53-live-source-1773427981586 - run id:
gwc-live-agent-wait-gpt53-source-1773427981614 - payload chars:
880150 - start result:
{"runId":"gwc-live-agent-wait-gpt53-source-1773427981614","status":"accepted","acceptedAt":1773427981959} - wait result:
{"runId":"gwc-live-agent-wait-gpt53-source-1773427981614","status":"ok","endedAt":1773427984243} - same session's terminal assistant message still recorded a real provider failure:
provider: "openai-codex"model: "gpt-5.3-codex"stopReason: "error"errorMessage: "Codex error: {\"type\":\"error\",\"error\":{\"type\":\"invalid_request_error\",\"code\":\"context_length_exceeded\",\"message\":\"Your input exceeds the context window of this model. Please adjust your input and try again.\",\"param\":\"input\"},\"sequence_number\":2}"
- session key:
- earlier temporary gateway runs reinforced the same mismatch:
- stale dist gateway repro run
gwc-live-agent-wait-gpt53-1773427893583also returnedstatus:"ok"while transcript stopReason remainederror - temp
gpt-5.4session repro on the same temp gateway returnedstatus:"error", but only because that runtime reportedFailoverError: Unknown model: openai-codex/gpt-5.4; that is useful as transport sanity, but not the canonical live semantics proof
- stale dist gateway repro run
- patched gateway launched directly from source on loopback with channels skipped:
- The final focused live-fix pass on 2026-03-13 closed the remaining
agent.waitbug.- root cause confirmed: the live direct gateway path could receive an inner
agent_endevent carrying a terminal assistant error without a precedingmessage_end, which left stale/empty assistant state and still emitted lifecyclephase:"end" - upstream fix extends the embedded subscribe lifecycle handler to recover the terminal assistant from
agent_end.messagesor the session transcript when state is stale, then emit lifecyclephase:"error"with a friendly error string instead ofend - upstream fix also updates the direct gateway
agentRPC handler to observe lifecycle events for the run and derive the final RPC payload/terminal status from observed lifecycle + resolved result metadata, instead of blindly cachingstatus:"ok"when the outer RPC resolves - files changed for the final fix:
src/agents/pi-embedded-subscribe.e2e-harness.tssrc/agents/pi-embedded-subscribe.handlers.lifecycle.tssrc/agents/pi-embedded-subscribe.handlers.lifecycle.test.tssrc/agents/pi-embedded-subscribe.handlers.tssrc/agents/pi-embedded-subscribe.subscribe-embedded-pi-session.subscribeembeddedpisession.test.tssrc/gateway/server-methods/agent.tssrc/gateway/server-methods/server-methods.test.ts
- root cause confirmed: the live direct gateway path could receive an inner
- Final targeted validation passed:
pnpm -C /home/openclaw/.openclaw/workspace/external/openclaw-upstream test -- --run src/agents/pi-embedded-subscribe.handlers.lifecycle.test.ts src/agents/pi-embedded-subscribe.subscribe-embedded-pi-session.subscribeembeddedpisession.test.ts src/commands/agent.test.ts src/gateway/server-methods/agent-wait-dedupe.test.ts src/gateway/server-methods/server-methods.test.ts- result:
108 testspassed across5files
- Final decisive live source-gateway repro after the fix:
- gateway launch:
OPENCLAW_SKIP_CHANNELS=1 CLAWDBOT_SKIP_CHANNELS=1 pnpm exec tsx src/index.ts gateway run --port 18903 --bind loopback --auth none --allow-unconfigured - run id:
gwc-live-agent-wait-gpt53-source-fixed2-1773429512008 - session key:
agent:main:subagent:agent-wait-gpt53-live-source-fixed2-1773429512008 - final
agentresponse withexpectFinal: truereturned:finalStatus: "error"finalSummary: "LLM request rejected: Your input exceeds the context window of this model. Please adjust your input and try again."
- matching
agent.waitreturned:{"runId":"gwc-live-agent-wait-gpt53-source-fixed2-1773429512008","status":"error","endedAt":1773429514106,"error":"LLM request rejected: Your input exceeds the context window of this model. Please adjust your input and try again."}
- gateway launch:
- Net status now:
- subagent persistence/announcement fix: live-verified ✅
- raw
agent.waitsemantics fix: live-verified ✅
- Side assessment on unrelated dirty upstream work: the
/subagents logUX diff insrc/auto-reply/reply/commands-subagents/action-log.ts+shared.tsis logically coherent and passedpnpm test -- --run src/auto-reply/reply/commands.test.ts(44 tests), but it is still out-of-scope for this focused reliability pass because there is no dedicated coverage for the new tool-only log behavior and it would muddy the focused branch. - ACP follow-up pass on 2026-03-13 found a new live-reproducible runtime bug in the bundled
extensions/acpxlayer:- current host state does not expose a global
acpxbinary on PATH, but the bundled plugin-local runtime exists and works at~/.local/share/pnpm/.../openclaw/extensions/acpx/node_modules/.bin/acpx - current
~/.openclaw/openclaw.jsondoes not contain an explicitacpblock or enabledacpxplugin entry, so this pass used the smallest direct runtime repro path instead of a fullsessions_spawn(runtime:"acp")OpenClaw run - live direct Codex repro now succeeds:
- command: bundled
acpx --format json --json-strict --timeout 15 codex exec 'reply with OK only' - result: clean JSON-RPC/session stream ending with
agent_message_chunk: "OK",id:2 result:{stopReason:"end_turn"}, processexit=0
- command: bundled
- live direct Claude repro does not crash, but returns top-level JSON-RPC auth errors and still exits 0:
- command: bundled
acpx --format json --json-strict --timeout 20 claude exec 'reply with OK only' - stdout included:
{"jsonrpc":"2.0","id":2,"error":{"code":-32000,"message":"Authentication required"}}{"jsonrpc":"2.0","id":null,"error":{"code":-32000,"message":"Authentication required"}}
- process
exit=0
- command: bundled
- source inspection showed
extensions/acpx/src/runtime-internals/events.tsignored that top-level JSON-RPC error shape during prompt streaming, soruntime.runTurn()could silently treat Claude auth failure as success (done) when no typederrorevent or non-zero exit was emitted
- current host state does not expose a global
- Implemented the smallest focused upstream runtime fix on branch
fix/subagent-wait-error-outcome:extensions/acpx/src/runtime-internals/events.tstoAcpxErrorEvent()now recognizes top-level JSON-RPCerrorresponses viaparseControlJsonError()parsePromptEventLine()now maps those JSON-RPC errors into ACP runtimetype:"error"events instead of dropping them
- regression coverage added:
extensions/acpx/src/runtime-internals/events.test.ts— top-level JSON-RPC prompt error parsingextensions/acpx/src/runtime-internals/test-fixtures.ts— mock prompt path for clean-exit JSON-RPC auth errorextensions/acpx/src/runtime.test.ts—runTurn()emits error and does not emitdonefor the Claude-style auth failure shape
- Targeted validation for the ACP follow-up fix passed:
cd external/openclaw-upstream && pnpm exec vitest run extensions/acpx/src/runtime-internals/events.test.ts extensions/acpx/src/runtime.test.ts extensions/acpx/src/runtime-internals/control-errors.test.ts- result:
3files passed,22tests passed
- Current interpretation of the old Claude/Codex ACP bug after this pass:
- historical notes still say
Claude: acpx exited with code 1,Codex: acpx exited with code 5 - those exact exit-code crashes were not reproduced today
- current live state is narrower and better understood:
- Codex ACP path works directly
- Claude ACP path currently fails for auth, and OpenClaw previously mishandled that failure shape in the acpx runtime layer
- historical notes still say
- Remaining open ACP follow-up after this fix:
- validate the patched runtime through the real OpenClaw ACP path (
sessions_spawn(runtime:"acp")) once ACP is explicitly enabled/configured here, or whenever a fresh end-to-end repro is available - only reopen the historical
acpx exited with code 1/5line if a fresh repro appears
- validate the patched runtime through the real OpenClaw ACP path (
Constraints
- Prefer evidence over theory.
- Do not claim a fix without concrete validation.
- Keep the main session clean; use this file as the canonical baton.
Success criteria
- Clear diagnosis of the current reliability problem(s).
- At least one of:
- implemented fix with validation, or
- sharply scoped next fix plan with exact evidence and files.
memory/2026-03-13.md(or current daily note),memory/tasks.json, and this WIP updated.