131 lines
14 KiB
Markdown
131 lines
14 KiB
Markdown
# 2026-03-13
|
|
|
|
## Subagent reliability investigation
|
|
- Fresh implementation subagent launch for subagent/ACP reliability failed immediately before doing any task work.
|
|
- Failure mode: delegated run was spawned with model `glm-5`, which resolved to provider model `zai/glm-5`.
|
|
- Current installed agent auth profile keys inspected in agent stores include `openai-codex:default`, `litellm:default`, and `github-copilot:github`.
|
|
- Will clarified on 2026-03-13 that Z.AI auth does exist in the environment, but the account is not entitled for `glm-5`.
|
|
- Verified by inspecting agent auth profile keys under:
|
|
- `/home/openclaw/.openclaw/agents/*/agent/auth-profiles.json`
|
|
- Relevant OpenClaw docs confirm:
|
|
- subagent spawns inherit caller model when `sessions_spawn.model` is omitted
|
|
- provider/model auth errors like `No API key found for provider "zai"` occur when a provider model is selected without matching auth
|
|
- multi-agent auth is per-agent via `~/.openclaw/agents/<agentId>/agent/auth-profiles.json`
|
|
- Conclusion: the immediate failure was caused by an incorrect explicit model selection in the spawn request, not by missing auth propagation between agents.
|
|
- Corrective action: retry fresh delegation with `litellm/glm-5` (the intended medium-tier routed model for delegated implementation work in this setup).
|
|
- Will explicitly requested on 2026-03-13 to use `gpt-5.4` for subagents for now while debugging delegation reliability.
|
|
- New evidence from the corrected run: `~/.openclaw/agents/main/sessions/1615a980-cf92-4d5e-845a-a2abe77c0418.jsonl` shows repeated assistant `stopReason:"error"` entries with `429 ... GLM-5 not included in current subscription plan`, but `~/.openclaw/subagents/runs.json` recorded run `776a8b51-6fdc-448e-83bc-55418814a05b` as `outcome.status: "ok"` and `frozenResultText: null`.
|
|
- That separates ACP/runtime choice problems from a generic subagent completion/reporting bug: a terminal assistant error can still be persisted/announced as success with no useful result.
|
|
- Implemented upstream fix on branch `external/openclaw-upstream@fix/subagent-wait-error-outcome`:
|
|
- added assistant terminal-outcome helper so empty-content assistant errors still yield usable terminal text
|
|
- subagent registry now downgrades `agent.wait => ok` to `error` when the child session's terminal assistant message is actually an error
|
|
- subagent announce flow now reports terminal assistant errors as failed outcomes instead of successful `(no output)` completions
|
|
- Targeted validation passed:
|
|
- `pnpm -C /home/openclaw/.openclaw/workspace/external/openclaw-upstream test -- --run src/agents/tools/sessions-helpers.terminal-text.test.ts src/agents/subagent-registry.persistence.test.ts src/gateway/server-methods/server-methods.test.ts`
|
|
- result: `50 tests` passed across `3` files
|
|
- Real success-path verification later passed on `gpt-5.4` with run `23750d80-b481-4f50-b219-cc9245be405f` and final child result `SUCCESS-PROBE-OK`.
|
|
- Real failure-path verification later also passed on valid `gpt-5.4` by intentionally triggering a `context_length_exceeded` provider error with a token-dense oversized task payload.
|
|
- child run: `b50cb91f-6219-44f7-9d2f-a1264ac7ceaf`
|
|
- child session: `agent:main:subagent:4c0dd686-cd2e-4cba-b80b-2fbf309a4594`
|
|
- child transcript: `~/.openclaw/agents/main/sessions/f114b831-000b-4070-a539-85c68d2b7057.jsonl`
|
|
- transcript terminal assistant entry recorded `provider:"openai-codex"`, `model:"gpt-5.4"`, `stopReason:"error"`, `errorMessage:"Codex error: {...context_length_exceeded...}"`
|
|
- matching `~/.openclaw/subagents/runs.json` now correctly stored:
|
|
- `outcome.status: "error"`
|
|
- `outcome.error: "Codex error: {...context_length_exceeded...}"`
|
|
- `endedReason: "subagent-error"`
|
|
- `frozenResultText: "Codex error: {...context_length_exceeded...}"`
|
|
- Important remaining nuance from the live repro: raw gateway `agent.wait` for that same failed child returned `status:"ok"` with only `endedAt` even though the child transcript terminal assistant message had `stopReason:"error"`.
|
|
- Follow-up code inspection on 2026-03-13 showed this is an upstream bug, not an intentional `agent.wait` layering choice:
|
|
- embedded subscribe lifecycle already emits `phase:"error"` for terminal assistant/provider failures
|
|
- but `src/commands/agent.ts` had a fallback lifecycle emitter that still sent `phase:"end"` whenever no inner lifecycle callback was observed, even if the resolved run result carried `meta.stopReason:"error"`
|
|
- `waitForAgentJob` gives lifecycle errors a retry grace window, so that fallback `end` could overwrite the terminal failure and make `agent.wait` resolve `ok`
|
|
- Implemented focused upstream follow-up on branch `fix/subagent-wait-error-outcome`:
|
|
- `src/commands/agent.ts` now emits lifecycle `phase:"error"` with extracted terminal error text when a resolved run stops with `meta.stopReason:"error"` and no inner lifecycle callback fired
|
|
- `src/gateway/server-methods/agent-wait-dedupe.ts` now also maps completed agent dedupe payloads with `result.meta.stopReason:"error"` to `status:"error"` and `aborted:true` to `status:"timeout"`
|
|
- Targeted validation passed:
|
|
- `pnpm -C /home/openclaw/.openclaw/workspace/external/openclaw-upstream test -- --run src/commands/agent.test.ts src/gateway/server-methods/agent-wait-dedupe.test.ts src/gateway/server-methods/server-methods.test.ts`
|
|
- result: `81 tests` passed across `3` files
|
|
- Live runtime verification was re-run later on 2026-03-13 and showed the current `agent.wait` follow-up fix still does **not** hold on the live direct gateway path.
|
|
- first temp-gateway sanity run via `GatewayClient` against loopback port `18901` on a persisted `gpt-5.4` session returned `status:"error"`, but only because that temp runtime reported `FailoverError: Unknown model: openai-codex/gpt-5.4`; useful as transport sanity, not canonical semantics proof
|
|
- stale-dist temp gateway repro on default model (`gpt-5.3-codex`) already showed the mismatch:
|
|
- session key: `agent:main:subagent:agent-wait-gpt53-live-1773427893572`
|
|
- run id: `gwc-live-agent-wait-gpt53-1773427893583`
|
|
- `agent.wait`: `{"runId":"gwc-live-agent-wait-gpt53-1773427893583","status":"ok","endedAt":1773427896100}`
|
|
- last assistant still recorded `stopReason:"error"` with `context_length_exceeded`
|
|
- decisive live source-gateway repro used a fresh source-run gateway on port `18902` launched with:
|
|
- `OPENCLAW_SKIP_CHANNELS=1 CLAWDBOT_SKIP_CHANNELS=1 pnpm exec tsx src/index.ts gateway run --port 18902 --bind loopback --auth none --allow-unconfigured`
|
|
- gateway log confirmed default model `openai-codex/gpt-5.3-codex`
|
|
- session key: `agent:main:subagent:agent-wait-gpt53-live-source-1773427981586`
|
|
- run id: `gwc-live-agent-wait-gpt53-source-1773427981614`
|
|
- payload chars: `880150`
|
|
- start: `{"runId":"gwc-live-agent-wait-gpt53-source-1773427981614","status":"accepted","acceptedAt":1773427981959}`
|
|
- `agent.wait`: `{"runId":"gwc-live-agent-wait-gpt53-source-1773427981614","status":"ok","endedAt":1773427984243}`
|
|
- same session's terminal assistant message still recorded:
|
|
- `provider:"openai-codex"`
|
|
- `model:"gpt-5.3-codex"`
|
|
- `stopReason:"error"`
|
|
- `errorMessage:"Codex error: {\"type\":\"error\",\"error\":{\"type\":\"invalid_request_error\",\"code\":\"context_length_exceeded\",\"message\":\"Your input exceeds the context window of this model. Please adjust your input and try again.\",\"param\":\"input\"},\"sequence_number\":2}"`
|
|
- Fast source inspection after that live repro points to the most likely remaining gap:
|
|
- `src/commands/agent.ts` only emits the new corrective lifecycle `phase:"error"` when `!lifecycleEnded`
|
|
- `lifecycleEnded` becomes true as soon as any inner lifecycle callback reports `phase:"end"` or `phase:"error"`
|
|
- `src/gateway/server-methods/agent-job.ts` still treats lifecycle `phase:"end"` as terminal `status:"ok"`
|
|
- so the likeliest still-open live bug is an inner lifecycle emitter marking terminal assistant/provider failures as `end` early enough that `agent.wait` resolves `ok` before the dedupe/result-meta rescue path matters
|
|
- Net status at end of this pass:
|
|
- subagent persistence/announcement fix: live-verified
|
|
- raw `agent.wait` follow-up fix: tests passed, but live source-gateway repro still failed; do not mark this closed
|
|
- Final focused live-fix pass on 2026-03-13 closed the remaining raw `agent.wait` bug.
|
|
- root cause: the live direct gateway path could receive `agent_end` carrying a terminal assistant error without a preceding `message_end`, leaving stale/empty assistant state and still emitting lifecycle `phase:"end"`
|
|
- final upstream fix taught embedded subscribe lifecycle handling to recover the terminal assistant from `agent_end.messages` / session transcript and emit lifecycle `phase:"error"`, and taught the gateway `agent` RPC handler to derive terminal status from observed lifecycle + final result metadata instead of blindly caching `ok`
|
|
- final targeted validation passed:
|
|
- `pnpm -C /home/openclaw/.openclaw/workspace/external/openclaw-upstream test -- --run src/agents/pi-embedded-subscribe.handlers.lifecycle.test.ts src/agents/pi-embedded-subscribe.subscribe-embedded-pi-session.subscribeembeddedpisession.test.ts src/commands/agent.test.ts src/gateway/server-methods/agent-wait-dedupe.test.ts src/gateway/server-methods/server-methods.test.ts`
|
|
- result: `108 tests` passed across `5` files
|
|
- decisive live source-gateway repro after the final fix:
|
|
- gateway: source-run on port `18903`
|
|
- run id: `gwc-live-agent-wait-gpt53-source-fixed2-1773429512008`
|
|
- final `agent` response returned `finalStatus:"error"`
|
|
- matching `agent.wait` returned `status:"error"` with the same context-window error text
|
|
- Net status now:
|
|
- subagent persistence/announcement fix: live-verified ✅
|
|
- raw `agent.wait` semantics fix: live-verified ✅
|
|
- Side note: unrelated dirty `/subagents log` UX changes in `external/openclaw-upstream` regression-passed `src/auto-reply/reply/commands.test.ts` (44 tests) but were intentionally left out-of-scope for this focused reliability pass.
|
|
|
|
## ACP Claude/Codex follow-up (post-`agent.wait` fix)
|
|
- Historical deferred task `task-20260304-211216-acp-claude-codex` still referenced old failures `Claude: acpx exited with code 1` and `Codex: acpx exited with code 5`, but those exact crashes were **not** reproduced in the latest focused pass.
|
|
- Current host state check:
|
|
- `claude` installed: `/home/linuxbrew/.linuxbrew/bin/claude` (`2.1.63`)
|
|
- `codex` installed: `/home/linuxbrew/.linuxbrew/bin/codex` (`0.107.0`)
|
|
- no global `acpx` on PATH, but bundled plugin-local runtime exists at `~/.local/share/pnpm/.../openclaw/extensions/acpx/node_modules/.bin/acpx`
|
|
- current `~/.openclaw/openclaw.json` only showed `plugins.entries.telegram.enabled=true`; no explicit `acp` block / `acpx` plugin entry was present, so the smallest reliable repro path used the bundled `acpx` directly rather than a full OpenClaw ACP session
|
|
- Live direct bundled-acpx repro results:
|
|
- Codex command:
|
|
- `.../acpx --format json --json-strict --timeout 15 codex exec 'reply with OK only'`
|
|
- result: clean JSON-RPC/session stream ended with `agent_message_chunk: "OK"`, `id:2 result:{stopReason:"end_turn"}`, process `exit=0`
|
|
- Claude command:
|
|
- `.../acpx --format json --json-strict --timeout 20 claude exec 'reply with OK only'`
|
|
- stdout included top-level JSON-RPC errors:
|
|
- `{"jsonrpc":"2.0","id":2,"error":{"code":-32000,"message":"Authentication required"}}`
|
|
- `{"jsonrpc":"2.0","id":null,"error":{"code":-32000,"message":"Authentication required"}}`
|
|
- process still exited `0`
|
|
- Source-level finding in `external/openclaw-upstream/extensions/acpx/src/runtime-internals/events.ts`:
|
|
- prompt parsing handled typed `{type:"error"}` lines but dropped top-level JSON-RPC `error` responses
|
|
- that meant `runtime.runTurn()` could treat a Claude auth failure as success (`done`) when the agent emitted JSON-RPC errors yet exited cleanly
|
|
- Implemented focused upstream fix on branch `fix/subagent-wait-error-outcome`:
|
|
- `extensions/acpx/src/runtime-internals/events.ts`
|
|
- `toAcpxErrorEvent()` now also recognizes top-level JSON-RPC `error` responses via `parseControlJsonError()`
|
|
- `parsePromptEventLine()` now emits ACP runtime `type:"error"` events for that shape instead of dropping it
|
|
- added regression coverage:
|
|
- `extensions/acpx/src/runtime-internals/events.test.ts`
|
|
- `extensions/acpx/src/runtime-internals/test-fixtures.ts`
|
|
- `extensions/acpx/src/runtime.test.ts`
|
|
- Targeted validation passed:
|
|
- `cd /home/openclaw/.openclaw/workspace/external/openclaw-upstream && pnpm exec vitest run extensions/acpx/src/runtime-internals/events.test.ts extensions/acpx/src/runtime.test.ts extensions/acpx/src/runtime-internals/control-errors.test.ts`
|
|
- result: `22` tests passed across `3` files
|
|
- Net status after this pass:
|
|
- old `acpx exited with code 1/5` reports remain historical evidence only
|
|
- Codex ACP direct runtime path works today
|
|
- Claude ACP direct runtime path currently fails for auth, and OpenClaw had a real bug in how the bundled acpx runtime parsed that failure shape
|
|
- remaining follow-up is end-to-end OpenClaw ACP-path validation once ACP is explicitly configured here (or if a fresh exit-code repro appears)
|
|
- Will also explicitly requested that zap keep a light eye on active subagents and check whether they look stuck instead of assuming they are fine until completion.
|
|
- Will explicitly reinforced on 2026-03-13 that once planning is done, zap should use subagents ASAP and start implementation in a fresh session rather than continuing to implement inside the long-lived main chat.
|
|
- Will explicitly asked on 2026-03-13 for more frequent checks on active subagent runs; zap should inspect/steer sooner instead of waiting for long silent stretches.
|