Files
swarm-zap/WIP.subagent-reliability.md

18 KiB

WIP.subagent-reliability.md

Status

Status: follow-up Owner: zap Opened: 2026-03-13 Last updated: 2026-03-13

Purpose

Investigate and improve subagent / ACP delegation reliability, including timeout behavior, runtime failures, and delayed/duplicate completion-event noise.

Current state

  • The core reliability thread tracked in this WIP is now fixed and live-verified on external/openclaw-upstream branch fix/subagent-wait-error-outcome.
  • Verified fixed:
    • subagent persistence / announcement handling for terminal assistant-provider failures
    • raw agent.wait semantics for the live direct gateway path
  • Key upstream commits on this branch:
    • 2a2ed0d6ffix(subagents): derive outcome from terminal assistant errors
    • 5a328d22bfix(agent): surface terminal run errors in wait semantics
    • f9a78e8f7fix(gateway): honor terminal assistant errors in live wait path

Why this file is still open

  • The broader delegation reliability task is not fully done yet.
  • Remaining follow-up work is now narrower:
    1. ACP-specific Claude/Codex runtime failures / final live OpenClaw ACP validation
    2. optional separate /subagents log UX cleanup
    3. push/PR the focused upstream reliability branch when desired
  • task-20260304-2215-subagent-reliability — in progress
  • task-20260304-211216-acp-claude-codex — open

Known context

  • Prior work already patched TUI formatting to suppress internal runtime completion context blocks.
  • Upstream patch exists in external/openclaw-upstream on branch fix/tui-hide-internal-runtime-context commit 0f66a4547.
  • User explicitly wants subagent tooling reliability fixed and completion-event spam prevented.
  • Fresh-session implementation discipline and monitoring thresholds were already documented locally.

Immediate baton

  • Do not reopen the solved agent.wait investigation unless a fresh repro appears.
  • If this project is resumed next, start with real OpenClaw ACP-path validation of the new acpx JSON-RPC error handling (or capture a fresh Claude/Codex end-to-end repro if ACP still is not configured here).
  • Treat the historical acpx exited with code 1/5 note as unresolved-but-unreproduced; do not spend more time on it without fresh evidence.
  • Treat /subagents log UX edits as a separate branch/pass so they do not muddy the reliability fix branch.

Evidence gathered so far

  • Fresh subagent run failed immediately when an explicit glm-5 choice resolved into the Z.AI provider path before any useful task execution.
  • Current installed agent auth profile keys inspected in agent stores include openai-codex:default, litellm:default, and github-copilot:github.
  • Will clarified that Z.AI auth does exist, but this account is not entitled for glm-5.
  • Root cause for this immediate repro is therefore best described as a provider/model entitlement mismatch caused by the explicit spawn model choice, not missing auth propagation between agents.
  • A later "corrected" run using litellm/glm-5 also did not succeed: child transcript ~/.openclaw/agents/main/sessions/1615a980-cf92-4d5e-845a-a2abe77c0418.jsonl contains repeated assistant stopReason:"error" entries with 429 ... subscription plan does not yet include access to GLM-5, while ~/.openclaw/subagents/runs.json recorded that run (776a8b51-6fdc-448e-83bc-55418814a05b) as outcome.status: "ok" with frozenResultText: null.
  • This separates the problems:
    • ACP/operator/model-selection issue: explicit glm-5zai/glm-5 without auth (already understood).
    • Generic subagent completion/reporting issue: terminal assistant errors can still be stored/announced as successful completion with no frozen result.
  • Implemented upstream patch on branch fix/subagent-wait-error-outcome in external/openclaw-upstream so subagent completion paths inspect the latest assistant terminal message and treat terminal assistant errors as outcome.status: "error" rather than ok.
  • Validation completed for targeted non-E2E coverage:
    • pnpm -C external/openclaw-upstream test -- --run src/agents/tools/sessions-helpers.terminal-text.test.ts src/agents/subagent-registry.persistence.test.ts src/gateway/server-methods/server-methods.test.ts
    • result: passed (50 tests across 3 files).
  • E2E-style subagent-announce.format.e2e.test.ts coverage was updated but the normal Vitest include rules exclude *.e2e.test.ts; direct pnpm test -- --run ...e2e... confirms exclusion rather than executing that file.
  • Tried to take over live verification directly in the main session on 2026-03-13:
    • confirmed upstream branch fix/subagent-wait-error-outcome is present with commit 2a2ed0d6f
    • confirmed normal packaged gateway was healthy before attempting runtime verification
    • first direct hot-swap attempt was interrupted at gateway stop time; systemd restored the packaged gateway cleanly
    • no patched upstream gateway was left running after that attempt
  • Current state: upstream patch + targeted tests are real.
  • Real subagent success verification now completed on gpt-5.4:
    • run id: 23750d80-b481-4f50-b219-cc9245be405f
    • child session: agent:main:subagent:ad2cc776-2527-4078-ab83-0220dbd09509
    • result: successful completion with a real final child result (SUCCESS-PROBE-OK)
  • A later GLM-5 probe was invalid for entitlement reasons and was terminated; it should not be treated as the canonical failure-path verification.
    • killed/failed run id: 4965775c-4764-41e9-a77a-692f1ab4c2fd
  • Live failure-path verification on a valid working model/runtime is now complete on gpt-5.4.
    • spawned child run: b50cb91f-6219-44f7-9d2f-a1264ac7ceaf
    • requester session: agent:main:subagent-reliability-failure-hex-1773425126098
    • child session: agent:main:subagent:4c0dd686-cd2e-4cba-b80b-2fbf309a4594
    • child transcript: ~/.openclaw/agents/main/sessions/f114b831-000b-4070-a539-85c68d2b7057.jsonl
    • terminal child assistant message (transcript line 6) recorded:
      • provider: "openai-codex"
      • model: "gpt-5.4"
      • stopReason: "error"
      • errorMessage: "Codex error: {\"type\":\"error\",\"error\":{\"type\":\"invalid_request_error\",\"code\":\"context_length_exceeded\",\"message\":\"Your input exceeds the context window of this model. Please adjust your input and try again.\",\"param\":\"input\"},\"sequence_number\":2}"
    • matching ~/.openclaw/subagents/runs.json record now correctly persisted:
      • outcome.status: "error"
      • outcome.error: "Codex error: {...context_length_exceeded...}"
      • endedReason: "subagent-error"
      • frozenResultText: "Codex error: {...context_length_exceeded...}"
  • Important nuance from the same live repro: raw gateway agent.wait still returned {"runId":"b50cb91f-6219-44f7-9d2f-a1264ac7ceaf","status":"ok","endedAt":1773425130881} for that failed child. So the current fix is verified for persisted/announced subagent outcomes, but not for the lower-level agent.wait RPC semantics.
  • Follow-up code inspection on 2026-03-13 found that the agent.wait mismatch is a real upstream bug, not intentional layering:
    • src/agents/pi-embedded-subscribe.handlers.lifecycle.ts already treats terminal assistant stopReason:"error" as lifecycle phase:"error".
    • src/gateway/server-methods/agent-wait-dedupe.ts now also interprets resolved agent RPC payloads with result.meta.stopReason:"error" as terminal status:"error" (and aborted:true as timeout).
    • but src/commands/agent.ts still had a fallback path that unconditionally emitted lifecycle phase:"end" whenever no inner lifecycle callback was observed, even if the resolved run result carried meta.stopReason:"error".
    • because waitForAgentJob gives lifecycle errors a retry grace window, that fallback end could overwrite the earlier failed state and make raw agent.wait resolve status:"ok" for a terminal assistant/provider error.
  • Implemented the smallest focused upstream fix on branch fix/subagent-wait-error-outcome:
    • src/commands/agent.ts now emits lifecycle phase:"error" (with extracted terminal error text) when a resolved run stops with meta.stopReason:"error" and no inner lifecycle callback fired.
    • src/commands/agent.test.ts adds coverage for that fallback path.
    • src/gateway/server-methods/agent-wait-dedupe.ts + agent-wait-dedupe.test.ts cover the dedupe snapshot path so completed agent RPC payloads with terminal assistant errors/timeouts also map to error/timeout instead of staying ok.
  • Targeted validation for this follow-up passed:
    • pnpm -C external/openclaw-upstream test -- --run src/commands/agent.test.ts src/gateway/server-methods/agent-wait-dedupe.test.ts src/gateway/server-methods/server-methods.test.ts
    • result: passed (81 tests across 3 files).
  • Follow-up live runtime verification on 2026-03-13 showed the current agent.wait fix did not close the live path yet.
    • patched gateway launched directly from source on loopback with channels skipped:
      • command: OPENCLAW_SKIP_CHANNELS=1 CLAWDBOT_SKIP_CHANNELS=1 pnpm exec tsx src/index.ts gateway run --port 18902 --bind loopback --auth none --allow-unconfigured
      • log evidence: 2026-03-13T18:52:10.743+00:00 [gateway] agent model: openai-codex/gpt-5.3-codex
    • live repro used a fresh default-model session and an oversized in-memory payload over GatewayClient (not CLI argv):
      • session key: agent:main:subagent:agent-wait-gpt53-live-source-1773427981586
      • run id: gwc-live-agent-wait-gpt53-source-1773427981614
      • payload chars: 880150
      • start result: {"runId":"gwc-live-agent-wait-gpt53-source-1773427981614","status":"accepted","acceptedAt":1773427981959}
      • wait result: {"runId":"gwc-live-agent-wait-gpt53-source-1773427981614","status":"ok","endedAt":1773427984243}
      • same session's terminal assistant message still recorded a real provider failure:
        • provider: "openai-codex"
        • model: "gpt-5.3-codex"
        • stopReason: "error"
        • errorMessage: "Codex error: {\"type\":\"error\",\"error\":{\"type\":\"invalid_request_error\",\"code\":\"context_length_exceeded\",\"message\":\"Your input exceeds the context window of this model. Please adjust your input and try again.\",\"param\":\"input\"},\"sequence_number\":2}"
    • earlier temporary gateway runs reinforced the same mismatch:
      • stale dist gateway repro run gwc-live-agent-wait-gpt53-1773427893583 also returned status:"ok" while transcript stopReason remained error
      • temp gpt-5.4 session repro on the same temp gateway returned status:"error", but only because that runtime reported FailoverError: Unknown model: openai-codex/gpt-5.4; that is useful as transport sanity, but not the canonical live semantics proof
  • The final focused live-fix pass on 2026-03-13 closed the remaining agent.wait bug.
    • root cause confirmed: the live direct gateway path could receive an inner agent_end event carrying a terminal assistant error without a preceding message_end, which left stale/empty assistant state and still emitted lifecycle phase:"end"
    • upstream fix extends the embedded subscribe lifecycle handler to recover the terminal assistant from agent_end.messages or the session transcript when state is stale, then emit lifecycle phase:"error" with a friendly error string instead of end
    • upstream fix also updates the direct gateway agent RPC handler to observe lifecycle events for the run and derive the final RPC payload/terminal status from observed lifecycle + resolved result metadata, instead of blindly caching status:"ok" when the outer RPC resolves
    • files changed for the final fix:
      • src/agents/pi-embedded-subscribe.e2e-harness.ts
      • src/agents/pi-embedded-subscribe.handlers.lifecycle.ts
      • src/agents/pi-embedded-subscribe.handlers.lifecycle.test.ts
      • src/agents/pi-embedded-subscribe.handlers.ts
      • src/agents/pi-embedded-subscribe.subscribe-embedded-pi-session.subscribeembeddedpisession.test.ts
      • src/gateway/server-methods/agent.ts
      • src/gateway/server-methods/server-methods.test.ts
  • Final targeted validation passed:
    • pnpm -C /home/openclaw/.openclaw/workspace/external/openclaw-upstream test -- --run src/agents/pi-embedded-subscribe.handlers.lifecycle.test.ts src/agents/pi-embedded-subscribe.subscribe-embedded-pi-session.subscribeembeddedpisession.test.ts src/commands/agent.test.ts src/gateway/server-methods/agent-wait-dedupe.test.ts src/gateway/server-methods/server-methods.test.ts
    • result: 108 tests passed across 5 files
  • Final decisive live source-gateway repro after the fix:
    • gateway launch: OPENCLAW_SKIP_CHANNELS=1 CLAWDBOT_SKIP_CHANNELS=1 pnpm exec tsx src/index.ts gateway run --port 18903 --bind loopback --auth none --allow-unconfigured
    • run id: gwc-live-agent-wait-gpt53-source-fixed2-1773429512008
    • session key: agent:main:subagent:agent-wait-gpt53-live-source-fixed2-1773429512008
    • final agent response with expectFinal: true returned:
      • finalStatus: "error"
      • finalSummary: "LLM request rejected: Your input exceeds the context window of this model. Please adjust your input and try again."
    • matching agent.wait returned:
      • {"runId":"gwc-live-agent-wait-gpt53-source-fixed2-1773429512008","status":"error","endedAt":1773429514106,"error":"LLM request rejected: Your input exceeds the context window of this model. Please adjust your input and try again."}
  • Net status now:
    • subagent persistence/announcement fix: live-verified
    • raw agent.wait semantics fix: live-verified
  • Side assessment on unrelated dirty upstream work: the /subagents log UX diff in src/auto-reply/reply/commands-subagents/action-log.ts + shared.ts is logically coherent and passed pnpm test -- --run src/auto-reply/reply/commands.test.ts (44 tests), but it is still out-of-scope for this focused reliability pass because there is no dedicated coverage for the new tool-only log behavior and it would muddy the focused branch.
  • ACP follow-up pass on 2026-03-13 found a new live-reproducible runtime bug in the bundled extensions/acpx layer:
    • current host state does not expose a global acpx binary on PATH, but the bundled plugin-local runtime exists and works at ~/.local/share/pnpm/.../openclaw/extensions/acpx/node_modules/.bin/acpx
    • current ~/.openclaw/openclaw.json does not contain an explicit acp block or enabled acpx plugin entry, so this pass used the smallest direct runtime repro path instead of a full sessions_spawn(runtime:"acp") OpenClaw run
    • live direct Codex repro now succeeds:
      • command: bundled acpx --format json --json-strict --timeout 15 codex exec 'reply with OK only'
      • result: clean JSON-RPC/session stream ending with agent_message_chunk: "OK", id:2 result:{stopReason:"end_turn"}, process exit=0
    • live direct Claude repro does not crash, but returns top-level JSON-RPC auth errors and still exits 0:
      • command: bundled acpx --format json --json-strict --timeout 20 claude exec 'reply with OK only'
      • stdout included:
        • {"jsonrpc":"2.0","id":2,"error":{"code":-32000,"message":"Authentication required"}}
        • {"jsonrpc":"2.0","id":null,"error":{"code":-32000,"message":"Authentication required"}}
      • process exit=0
    • source inspection showed extensions/acpx/src/runtime-internals/events.ts ignored that top-level JSON-RPC error shape during prompt streaming, so runtime.runTurn() could silently treat Claude auth failure as success (done) when no typed error event or non-zero exit was emitted
  • Implemented the smallest focused upstream runtime fix on branch fix/subagent-wait-error-outcome:
    • extensions/acpx/src/runtime-internals/events.ts
      • toAcpxErrorEvent() now recognizes top-level JSON-RPC error responses via parseControlJsonError()
      • parsePromptEventLine() now maps those JSON-RPC errors into ACP runtime type:"error" events instead of dropping them
    • regression coverage added:
      • extensions/acpx/src/runtime-internals/events.test.ts — top-level JSON-RPC prompt error parsing
      • extensions/acpx/src/runtime-internals/test-fixtures.ts — mock prompt path for clean-exit JSON-RPC auth error
      • extensions/acpx/src/runtime.test.tsrunTurn() emits error and does not emit done for the Claude-style auth failure shape
  • Targeted validation for the ACP follow-up fix passed:
    • cd external/openclaw-upstream && pnpm exec vitest run extensions/acpx/src/runtime-internals/events.test.ts extensions/acpx/src/runtime.test.ts extensions/acpx/src/runtime-internals/control-errors.test.ts
    • result: 3 files passed, 22 tests passed
  • Current interpretation of the old Claude/Codex ACP bug after this pass:
    • historical notes still say Claude: acpx exited with code 1, Codex: acpx exited with code 5
    • those exact exit-code crashes were not reproduced today
    • current live state is narrower and better understood:
      • Codex ACP path works directly
      • Claude ACP path currently fails for auth, and OpenClaw previously mishandled that failure shape in the acpx runtime layer
  • Remaining open ACP follow-up after this fix:
    • validate the patched runtime through the real OpenClaw ACP path (sessions_spawn(runtime:"acp")) once ACP is explicitly enabled/configured here, or whenever a fresh end-to-end repro is available
    • only reopen the historical acpx exited with code 1/5 line if a fresh repro appears

Constraints

  • Prefer evidence over theory.
  • Do not claim a fix without concrete validation.
  • Keep the main session clean; use this file as the canonical baton.

Success criteria

  • Clear diagnosis of the current reliability problem(s).
  • At least one of:
    • implemented fix with validation, or
    • sharply scoped next fix plan with exact evidence and files.
  • memory/2026-03-13.md (or current daily note), memory/tasks.json, and this WIP updated.