From f2b99841af8c960dcf90cf74c576a04cbf0f1ee6 Mon Sep 17 00:00:00 2001 From: zap Date: Fri, 13 Mar 2026 18:56:52 +0000 Subject: [PATCH] docs(reliability): record live agent.wait blocker evidence --- HANDOFF.md | 21 ++++++++++++++------- WIP.subagent-reliability.md | 25 ++++++++++++++++++++++++- memory/2026-03-13.md | 29 ++++++++++++++++++++++++++++- memory/tasks.json | 3 ++- 4 files changed, 68 insertions(+), 10 deletions(-) diff --git a/HANDOFF.md b/HANDOFF.md index b935c47..378b894 100644 --- a/HANDOFF.md +++ b/HANDOFF.md @@ -4,7 +4,7 @@ Immediate baton-pass for the next fresh implementation session. ## Current objective -Investigate and improve subagent / ACP delegation reliability with evidence-first debugging. The failure-path proof for the new subagent outcome handling is captured, and a focused upstream `agent.wait` semantics fix is now implemented/tested on branch `fix/subagent-wait-error-outcome`; the remaining follow-up is deployment/live verification, not root-cause discovery. +Investigate and improve subagent / ACP delegation reliability with evidence-first debugging. The failure-path proof for the new subagent outcome handling is captured, but the focused upstream `agent.wait` semantics fix on branch `fix/subagent-wait-error-outcome` did **not** hold in a fresh live source-gateway repro, so the remaining work is a narrower root-cause follow-up on the still-open live `agent.wait => ok` path. ## Use these state files first 1. `WIP.subagent-reliability.md` — canonical state for this pass @@ -31,12 +31,19 @@ Investigate and improve subagent / ACP delegation reliability with evidence-firs - run id `b50cb91f-6219-44f7-9d2f-a1264ac7ceaf` - child transcript `~/.openclaw/agents/main/sessions/f114b831-000b-4070-a539-85c68d2b7057.jsonl` - `runs.json` now stores `outcome.status: "error"`, `endedReason: "subagent-error"`, and a non-null `frozenResultText` -2. For raw gateway `agent.wait`, use the new upstream diagnosis/fix rather than re-arguing semantics: - - decision: the previous `status:"ok"` was a bug, not intended layering - - cause: `src/commands/agent.ts` fallback lifecycle emission used `phase:"end"` even when resolved run `meta.stopReason:"error"` - - fix: `src/commands/agent.ts` now emits lifecycle `phase:"error"` with extracted terminal error text in that case; `src/gateway/server-methods/agent-wait-dedupe.ts` also maps resolved agent payloads with terminal `stopReason:"error"` / `aborted:true` to `error` / `timeout` - - targeted validation passed: `pnpm -C external/openclaw-upstream test -- --run src/commands/agent.test.ts src/gateway/server-methods/agent-wait-dedupe.test.ts src/gateway/server-methods/server-methods.test.ts` -3. If continuing, do a low-noise live verification on the patched gateway/runtime for the same failure class, then report whether raw `agent.wait` now returns `status:"error"` as expected. +2. Treat raw gateway `agent.wait` as still **open** despite the current follow-up fix branch. + - decisive live source-gateway repro: + - gateway launch: `OPENCLAW_SKIP_CHANNELS=1 CLAWDBOT_SKIP_CHANNELS=1 pnpm exec tsx src/index.ts gateway run --port 18902 --bind loopback --auth none --allow-unconfigured` + - session key: `agent:main:subagent:agent-wait-gpt53-live-source-1773427981586` + - run id: `gwc-live-agent-wait-gpt53-source-1773427981614` + - `agent.wait`: `{"runId":"gwc-live-agent-wait-gpt53-source-1773427981614","status":"ok","endedAt":1773427984243}` + - last assistant: `provider:"openai-codex" model:"gpt-5.3-codex" stopReason:"error" errorMessage contains context_length_exceeded` + - this is the current canonical blocker evidence for the still-open live path +3. Most likely remaining gap to investigate next: + - `src/commands/agent.ts` only applies the new fallback correction when `!lifecycleEnded` + - `lifecycleEnded` flips true on any inner lifecycle `phase:"end"` or `phase:"error"` + - `src/gateway/server-methods/agent-job.ts` resolves/caches `phase:"end"` as terminal `status:"ok"` + - so an inner lifecycle emitter is still the likeliest place where terminal assistant/provider failures are being marked `end` too early on the live direct gateway path 4. Re-check whether ACP-specific Claude/Codex runtime failures are still reproducible after separating them from the generic subagent outcome bug. 5. Leave the dirty `/subagents log` UX diff out of this branch unless you intentionally spin a separate focused pass; it regression-passed `src/auto-reply/reply/commands.test.ts` but still lacks dedicated feature coverage. diff --git a/WIP.subagent-reliability.md b/WIP.subagent-reliability.md index 10598f3..57f5e06 100644 --- a/WIP.subagent-reliability.md +++ b/WIP.subagent-reliability.md @@ -89,7 +89,30 @@ This is the highest-leverage remaining open reliability item because it affects - Targeted validation for this follow-up passed: - `pnpm -C external/openclaw-upstream test -- --run src/commands/agent.test.ts src/gateway/server-methods/agent-wait-dedupe.test.ts src/gateway/server-methods/server-methods.test.ts` - result: passed (`81 tests` across `3` files). -- Remaining open item: no second live hot-swap/runtime repro was attempted in this pass, so the new `agent.wait` fix is validated by exact code-path inspection plus focused tests, not yet by another live gateway run. +- Follow-up live runtime verification on 2026-03-13 showed the current `agent.wait` fix did **not** close the live path yet. + - patched gateway launched directly from source on loopback with channels skipped: + - command: `OPENCLAW_SKIP_CHANNELS=1 CLAWDBOT_SKIP_CHANNELS=1 pnpm exec tsx src/index.ts gateway run --port 18902 --bind loopback --auth none --allow-unconfigured` + - log evidence: `2026-03-13T18:52:10.743+00:00 [gateway] agent model: openai-codex/gpt-5.3-codex` + - live repro used a fresh default-model session and an oversized in-memory payload over `GatewayClient` (not CLI argv): + - session key: `agent:main:subagent:agent-wait-gpt53-live-source-1773427981586` + - run id: `gwc-live-agent-wait-gpt53-source-1773427981614` + - payload chars: `880150` + - start result: `{"runId":"gwc-live-agent-wait-gpt53-source-1773427981614","status":"accepted","acceptedAt":1773427981959}` + - wait result: `{"runId":"gwc-live-agent-wait-gpt53-source-1773427981614","status":"ok","endedAt":1773427984243}` + - same session's terminal assistant message still recorded a real provider failure: + - `provider: "openai-codex"` + - `model: "gpt-5.3-codex"` + - `stopReason: "error"` + - `errorMessage: "Codex error: {\"type\":\"error\",\"error\":{\"type\":\"invalid_request_error\",\"code\":\"context_length_exceeded\",\"message\":\"Your input exceeds the context window of this model. Please adjust your input and try again.\",\"param\":\"input\"},\"sequence_number\":2}"` + - earlier temporary gateway runs reinforced the same mismatch: + - stale dist gateway repro run `gwc-live-agent-wait-gpt53-1773427893583` also returned `status:"ok"` while transcript stopReason remained `error` + - temp `gpt-5.4` session repro on the same temp gateway returned `status:"error"`, but only because that runtime reported `FailoverError: Unknown model: openai-codex/gpt-5.4`; that is useful as transport sanity, but **not** the canonical live semantics proof +- Most likely remaining code-path gap (high-confidence from source inspection): + - `src/commands/agent.ts` only applies the new fallback correction when `!lifecycleEnded` + - `lifecycleEnded` is set as soon as any inner lifecycle callback reports `phase:"end"` or `phase:"error"` + - `src/gateway/server-methods/agent-job.ts` immediately caches/resolves `phase:"end"` as terminal `status:"ok"` + - so if an inner embedded lifecycle emitter still reports `phase:"end"` for a run whose final assistant message later has `stopReason:"error"`, `agent.wait` will still resolve `ok` before the dedupe/result-meta rescue path matters + - likely next target: identify the inner lifecycle emitter that is still producing `phase:"end"` on this direct gateway path and either convert that event to `phase:"error"` for terminal assistant failures or make `agent.wait` prefer final dedupe/result-meta over earlier lifecycle `end` when both exist for the same run - Side assessment on unrelated dirty upstream work: the `/subagents log` UX diff in `src/auto-reply/reply/commands-subagents/action-log.ts` + `shared.ts` is logically coherent and passed `pnpm test -- --run src/auto-reply/reply/commands.test.ts` (`44 tests`), but it is still out-of-scope for this reliability pass because there is no dedicated coverage for the new tool-only log behavior and it would muddy the focused branch. ## Constraints diff --git a/memory/2026-03-13.md b/memory/2026-03-13.md index c88929a..25c2892 100644 --- a/memory/2026-03-13.md +++ b/memory/2026-03-13.md @@ -45,7 +45,34 @@ - Targeted validation passed: - `pnpm -C /home/openclaw/.openclaw/workspace/external/openclaw-upstream test -- --run src/commands/agent.test.ts src/gateway/server-methods/agent-wait-dedupe.test.ts src/gateway/server-methods/server-methods.test.ts` - result: `81 tests` passed across `3` files -- Live runtime verification of the new `agent.wait` fix was not re-run in this pass; current evidence is exact code-path inspection plus focused tests. +- Live runtime verification was re-run later on 2026-03-13 and showed the current `agent.wait` follow-up fix still does **not** hold on the live direct gateway path. + - first temp-gateway sanity run via `GatewayClient` against loopback port `18901` on a persisted `gpt-5.4` session returned `status:"error"`, but only because that temp runtime reported `FailoverError: Unknown model: openai-codex/gpt-5.4`; useful as transport sanity, not canonical semantics proof + - stale-dist temp gateway repro on default model (`gpt-5.3-codex`) already showed the mismatch: + - session key: `agent:main:subagent:agent-wait-gpt53-live-1773427893572` + - run id: `gwc-live-agent-wait-gpt53-1773427893583` + - `agent.wait`: `{"runId":"gwc-live-agent-wait-gpt53-1773427893583","status":"ok","endedAt":1773427896100}` + - last assistant still recorded `stopReason:"error"` with `context_length_exceeded` + - decisive live source-gateway repro used a fresh source-run gateway on port `18902` launched with: + - `OPENCLAW_SKIP_CHANNELS=1 CLAWDBOT_SKIP_CHANNELS=1 pnpm exec tsx src/index.ts gateway run --port 18902 --bind loopback --auth none --allow-unconfigured` + - gateway log confirmed default model `openai-codex/gpt-5.3-codex` + - session key: `agent:main:subagent:agent-wait-gpt53-live-source-1773427981586` + - run id: `gwc-live-agent-wait-gpt53-source-1773427981614` + - payload chars: `880150` + - start: `{"runId":"gwc-live-agent-wait-gpt53-source-1773427981614","status":"accepted","acceptedAt":1773427981959}` + - `agent.wait`: `{"runId":"gwc-live-agent-wait-gpt53-source-1773427981614","status":"ok","endedAt":1773427984243}` + - same session's terminal assistant message still recorded: + - `provider:"openai-codex"` + - `model:"gpt-5.3-codex"` + - `stopReason:"error"` + - `errorMessage:"Codex error: {\"type\":\"error\",\"error\":{\"type\":\"invalid_request_error\",\"code\":\"context_length_exceeded\",\"message\":\"Your input exceeds the context window of this model. Please adjust your input and try again.\",\"param\":\"input\"},\"sequence_number\":2}"` +- Fast source inspection after that live repro points to the most likely remaining gap: + - `src/commands/agent.ts` only emits the new corrective lifecycle `phase:"error"` when `!lifecycleEnded` + - `lifecycleEnded` becomes true as soon as any inner lifecycle callback reports `phase:"end"` or `phase:"error"` + - `src/gateway/server-methods/agent-job.ts` still treats lifecycle `phase:"end"` as terminal `status:"ok"` + - so the likeliest still-open live bug is an inner lifecycle emitter marking terminal assistant/provider failures as `end` early enough that `agent.wait` resolves `ok` before the dedupe/result-meta rescue path matters +- Net status at end of this pass: + - subagent persistence/announcement fix: live-verified + - raw `agent.wait` follow-up fix: tests passed, but live source-gateway repro still failed; do not mark this closed - Side note: unrelated dirty `/subagents log` UX changes in `external/openclaw-upstream` regression-passed `src/auto-reply/reply/commands.test.ts` (44 tests) but were intentionally left out-of-scope for this focused reliability pass. - Will also explicitly requested that zap keep a light eye on active subagents and check whether they look stuck instead of assuming they are fine until completion. - Will explicitly reinforced on 2026-03-13 that once planning is done, zap should use subagents ASAP and start implementation in a fresh session rather than continuing to implement inside the long-lived main chat. diff --git a/memory/tasks.json b/memory/tasks.json index 9b82698..0823ba6 100644 --- a/memory/tasks.json +++ b/memory/tasks.json @@ -29,7 +29,8 @@ "Validation: pnpm test:fast completed successfully (812 files / 6599 tests passing) at 2026-03-04T22:53:29Z", "2026-03-13: confirmed corrected LiteLLM run was still failing (child transcript showed assistant 429/plan error for GLM-5) while runs.json incorrectly stored outcome.status=ok and frozenResultText=null; implemented upstream branch fix/subagent-wait-error-outcome to derive terminal subagent outcome from latest assistant error state, with targeted validation (50 tests passed across 3 files).", "2026-03-13 later: live gpt-5.4 success repro passed (run 23750d80-b481-4f50-b219-cc9245be405f). Live gpt-5.4 failure repro also passed for subagent persistence/announcement handling: child run b50cb91f-6219-44f7-9d2f-a1264ac7ceaf ended with transcript stopReason=error + context_length_exceeded, and runs.json now stored outcome.status=error / endedReason=subagent-error / frozenResultText non-null. Remaining open nuance: raw agent.wait for that same failed child still returned status=ok.", - "2026-03-13 later: traced raw agent.wait=status:ok-on-terminal-error to an upstream bug in commands/agent.ts fallback lifecycle emission (phase:end emitted even when resolved run meta.stopReason=error). Added focused upstream fix plus dedupe-path handling/tests on branch fix/subagent-wait-error-outcome; targeted validation passed (81 tests across commands/agent.test.ts, gateway/server-methods/agent-wait-dedupe.test.ts, gateway/server-methods/server-methods.test.ts). Live verification of the new agent.wait behavior remains open." + "2026-03-13 later: traced raw agent.wait=status:ok-on-terminal-error to an upstream bug in commands/agent.ts fallback lifecycle emission (phase:end emitted even when resolved run meta.stopReason=error). Added focused upstream fix plus dedupe-path handling/tests on branch fix/subagent-wait-error-outcome; targeted validation passed (81 tests across commands/agent.test.ts, gateway/server-methods/agent-wait-dedupe.test.ts, gateway/server-methods/server-methods.test.ts). Live verification of the new agent.wait behavior remains open.", + "2026-03-13 final live pass: a fresh source-run gateway on port 18902 still returned agent.wait status=ok for run gwc-live-agent-wait-gpt53-source-1773427981614 even though the same session's terminal assistant message had provider=openai-codex model=gpt-5.3-codex stopReason=error with context_length_exceeded. Most likely remaining gap: an inner lifecycle emitter still marks the live direct gateway path as phase:end early enough that waitForAgentJob resolves ok before dedupe/result-meta rescue logic can win." ] }, {