docs(reliability): record agent wait fix diagnosis
This commit is contained in:
15
HANDOFF.md
15
HANDOFF.md
@@ -4,7 +4,7 @@
|
|||||||
Immediate baton-pass for the next fresh implementation session.
|
Immediate baton-pass for the next fresh implementation session.
|
||||||
|
|
||||||
## Current objective
|
## Current objective
|
||||||
Investigate and improve subagent / ACP delegation reliability with evidence-first debugging. The failure-path proof for the new subagent outcome handling is now captured; the remaining question is whether lower-level `agent.wait` semantics also need fixing or whether the issue is sufficiently solved at the subagent registry / completion layer.
|
Investigate and improve subagent / ACP delegation reliability with evidence-first debugging. The failure-path proof for the new subagent outcome handling is captured, and a focused upstream `agent.wait` semantics fix is now implemented/tested on branch `fix/subagent-wait-error-outcome`; the remaining follow-up is deployment/live verification, not root-cause discovery.
|
||||||
|
|
||||||
## Use these state files first
|
## Use these state files first
|
||||||
1. `WIP.subagent-reliability.md` — canonical state for this pass
|
1. `WIP.subagent-reliability.md` — canonical state for this pass
|
||||||
@@ -31,11 +31,14 @@ Investigate and improve subagent / ACP delegation reliability with evidence-firs
|
|||||||
- run id `b50cb91f-6219-44f7-9d2f-a1264ac7ceaf`
|
- run id `b50cb91f-6219-44f7-9d2f-a1264ac7ceaf`
|
||||||
- child transcript `~/.openclaw/agents/main/sessions/f114b831-000b-4070-a539-85c68d2b7057.jsonl`
|
- child transcript `~/.openclaw/agents/main/sessions/f114b831-000b-4070-a539-85c68d2b7057.jsonl`
|
||||||
- `runs.json` now stores `outcome.status: "error"`, `endedReason: "subagent-error"`, and a non-null `frozenResultText`
|
- `runs.json` now stores `outcome.status: "error"`, `endedReason: "subagent-error"`, and a non-null `frozenResultText`
|
||||||
2. Decide whether raw gateway `agent.wait` should also report `status: "error"` for terminal assistant errors. Current live evidence for the same failed child:
|
2. For raw gateway `agent.wait`, use the new upstream diagnosis/fix rather than re-arguing semantics:
|
||||||
- `agent.wait` returned `{"runId":"b50cb91f-6219-44f7-9d2f-a1264ac7ceaf","status":"ok","endedAt":1773425130881}`
|
- decision: the previous `status:"ok"` was a bug, not intended layering
|
||||||
3. Re-check whether ACP-specific Claude/Codex runtime failures are still reproducible after separating them from the generic subagent outcome bug.
|
- cause: `src/commands/agent.ts` fallback lifecycle emission used `phase:"end"` even when resolved run `meta.stopReason:"error"`
|
||||||
4. Leave the dirty `/subagents log` UX diff out of this branch unless you intentionally spin a separate focused pass; it regression-passed `src/auto-reply/reply/commands.test.ts` but still lacks dedicated feature coverage.
|
- fix: `src/commands/agent.ts` now emits lifecycle `phase:"error"` with extracted terminal error text in that case; `src/gateway/server-methods/agent-wait-dedupe.ts` also maps resolved agent payloads with terminal `stopReason:"error"` / `aborted:true` to `error` / `timeout`
|
||||||
5. Update WIP + memory + tasks before ending.
|
- targeted validation passed: `pnpm -C external/openclaw-upstream test -- --run src/commands/agent.test.ts src/gateway/server-methods/agent-wait-dedupe.test.ts src/gateway/server-methods/server-methods.test.ts`
|
||||||
|
3. If continuing, do a low-noise live verification on the patched gateway/runtime for the same failure class, then report whether raw `agent.wait` now returns `status:"error"` as expected.
|
||||||
|
4. Re-check whether ACP-specific Claude/Codex runtime failures are still reproducible after separating them from the generic subagent outcome bug.
|
||||||
|
5. Leave the dirty `/subagents log` UX diff out of this branch unless you intentionally spin a separate focused pass; it regression-passed `src/auto-reply/reply/commands.test.ts` but still lacks dedicated feature coverage.
|
||||||
|
|
||||||
## Success criteria
|
## Success criteria
|
||||||
- Real-run verification of the new error/outcome fix. ✅ done for subagent persistence/announcement handling.
|
- Real-run verification of the new error/outcome fix. ✅ done for subagent persistence/announcement handling.
|
||||||
|
|||||||
@@ -77,8 +77,20 @@ This is the highest-leverage remaining open reliability item because it affects
|
|||||||
- `endedReason: "subagent-error"`
|
- `endedReason: "subagent-error"`
|
||||||
- `frozenResultText: "Codex error: {...context_length_exceeded...}"`
|
- `frozenResultText: "Codex error: {...context_length_exceeded...}"`
|
||||||
- Important nuance from the same live repro: raw gateway `agent.wait` still returned `{"runId":"b50cb91f-6219-44f7-9d2f-a1264ac7ceaf","status":"ok","endedAt":1773425130881}` for that failed child. So the current fix is verified for persisted/announced **subagent outcomes**, but **not** for the lower-level `agent.wait` RPC semantics.
|
- Important nuance from the same live repro: raw gateway `agent.wait` still returned `{"runId":"b50cb91f-6219-44f7-9d2f-a1264ac7ceaf","status":"ok","endedAt":1773425130881}` for that failed child. So the current fix is verified for persisted/announced **subagent outcomes**, but **not** for the lower-level `agent.wait` RPC semantics.
|
||||||
|
- Follow-up code inspection on 2026-03-13 found that the `agent.wait` mismatch is a real upstream bug, not intentional layering:
|
||||||
|
- `src/agents/pi-embedded-subscribe.handlers.lifecycle.ts` already treats terminal assistant `stopReason:"error"` as lifecycle `phase:"error"`.
|
||||||
|
- `src/gateway/server-methods/agent-wait-dedupe.ts` now also interprets resolved agent RPC payloads with `result.meta.stopReason:"error"` as terminal `status:"error"` (and `aborted:true` as `timeout`).
|
||||||
|
- but `src/commands/agent.ts` still had a fallback path that unconditionally emitted lifecycle `phase:"end"` whenever no inner lifecycle callback was observed, even if the resolved run result carried `meta.stopReason:"error"`.
|
||||||
|
- because `waitForAgentJob` gives lifecycle errors a retry grace window, that fallback `end` could overwrite the earlier failed state and make raw `agent.wait` resolve `status:"ok"` for a terminal assistant/provider error.
|
||||||
|
- Implemented the smallest focused upstream fix on branch `fix/subagent-wait-error-outcome`:
|
||||||
|
- `src/commands/agent.ts` now emits lifecycle `phase:"error"` (with extracted terminal error text) when a resolved run stops with `meta.stopReason:"error"` and no inner lifecycle callback fired.
|
||||||
|
- `src/commands/agent.test.ts` adds coverage for that fallback path.
|
||||||
|
- `src/gateway/server-methods/agent-wait-dedupe.ts` + `agent-wait-dedupe.test.ts` cover the dedupe snapshot path so completed agent RPC payloads with terminal assistant errors/timeouts also map to `error`/`timeout` instead of staying `ok`.
|
||||||
|
- Targeted validation for this follow-up passed:
|
||||||
|
- `pnpm -C external/openclaw-upstream test -- --run src/commands/agent.test.ts src/gateway/server-methods/agent-wait-dedupe.test.ts src/gateway/server-methods/server-methods.test.ts`
|
||||||
|
- result: passed (`81 tests` across `3` files).
|
||||||
|
- Remaining open item: no second live hot-swap/runtime repro was attempted in this pass, so the new `agent.wait` fix is validated by exact code-path inspection plus focused tests, not yet by another live gateway run.
|
||||||
- Side assessment on unrelated dirty upstream work: the `/subagents log` UX diff in `src/auto-reply/reply/commands-subagents/action-log.ts` + `shared.ts` is logically coherent and passed `pnpm test -- --run src/auto-reply/reply/commands.test.ts` (`44 tests`), but it is still out-of-scope for this reliability pass because there is no dedicated coverage for the new tool-only log behavior and it would muddy the focused branch.
|
- Side assessment on unrelated dirty upstream work: the `/subagents log` UX diff in `src/auto-reply/reply/commands-subagents/action-log.ts` + `shared.ts` is logically coherent and passed `pnpm test -- --run src/auto-reply/reply/commands.test.ts` (`44 tests`), but it is still out-of-scope for this reliability pass because there is no dedicated coverage for the new tool-only log behavior and it would muddy the focused branch.
|
||||||
- Next step if continuing core work: decide whether `agent.wait` itself should downgrade terminal assistant errors to `status: "error"`, or whether the current contract is acceptable now that subagent registry persistence/announcements are fixed.
|
|
||||||
|
|
||||||
## Constraints
|
## Constraints
|
||||||
- Prefer evidence over theory.
|
- Prefer evidence over theory.
|
||||||
|
|||||||
@@ -34,7 +34,18 @@
|
|||||||
- `outcome.error: "Codex error: {...context_length_exceeded...}"`
|
- `outcome.error: "Codex error: {...context_length_exceeded...}"`
|
||||||
- `endedReason: "subagent-error"`
|
- `endedReason: "subagent-error"`
|
||||||
- `frozenResultText: "Codex error: {...context_length_exceeded...}"`
|
- `frozenResultText: "Codex error: {...context_length_exceeded...}"`
|
||||||
- Important remaining nuance: raw gateway `agent.wait` for that same failed child still returned `status:"ok"` with only `endedAt`, so the current fix is verified for subagent outcome persistence/announcements but not for lower-level `agent.wait` semantics.
|
- Important remaining nuance from the live repro: raw gateway `agent.wait` for that same failed child returned `status:"ok"` with only `endedAt` even though the child transcript terminal assistant message had `stopReason:"error"`.
|
||||||
|
- Follow-up code inspection on 2026-03-13 showed this is an upstream bug, not an intentional `agent.wait` layering choice:
|
||||||
|
- embedded subscribe lifecycle already emits `phase:"error"` for terminal assistant/provider failures
|
||||||
|
- but `src/commands/agent.ts` had a fallback lifecycle emitter that still sent `phase:"end"` whenever no inner lifecycle callback was observed, even if the resolved run result carried `meta.stopReason:"error"`
|
||||||
|
- `waitForAgentJob` gives lifecycle errors a retry grace window, so that fallback `end` could overwrite the terminal failure and make `agent.wait` resolve `ok`
|
||||||
|
- Implemented focused upstream follow-up on branch `fix/subagent-wait-error-outcome`:
|
||||||
|
- `src/commands/agent.ts` now emits lifecycle `phase:"error"` with extracted terminal error text when a resolved run stops with `meta.stopReason:"error"` and no inner lifecycle callback fired
|
||||||
|
- `src/gateway/server-methods/agent-wait-dedupe.ts` now also maps completed agent dedupe payloads with `result.meta.stopReason:"error"` to `status:"error"` and `aborted:true` to `status:"timeout"`
|
||||||
|
- Targeted validation passed:
|
||||||
|
- `pnpm -C /home/openclaw/.openclaw/workspace/external/openclaw-upstream test -- --run src/commands/agent.test.ts src/gateway/server-methods/agent-wait-dedupe.test.ts src/gateway/server-methods/server-methods.test.ts`
|
||||||
|
- result: `81 tests` passed across `3` files
|
||||||
|
- Live runtime verification of the new `agent.wait` fix was not re-run in this pass; current evidence is exact code-path inspection plus focused tests.
|
||||||
- Side note: unrelated dirty `/subagents log` UX changes in `external/openclaw-upstream` regression-passed `src/auto-reply/reply/commands.test.ts` (44 tests) but were intentionally left out-of-scope for this focused reliability pass.
|
- Side note: unrelated dirty `/subagents log` UX changes in `external/openclaw-upstream` regression-passed `src/auto-reply/reply/commands.test.ts` (44 tests) but were intentionally left out-of-scope for this focused reliability pass.
|
||||||
- Will also explicitly requested that zap keep a light eye on active subagents and check whether they look stuck instead of assuming they are fine until completion.
|
- Will also explicitly requested that zap keep a light eye on active subagents and check whether they look stuck instead of assuming they are fine until completion.
|
||||||
- Will explicitly reinforced on 2026-03-13 that once planning is done, zap should use subagents ASAP and start implementation in a fresh session rather than continuing to implement inside the long-lived main chat.
|
- Will explicitly reinforced on 2026-03-13 that once planning is done, zap should use subagents ASAP and start implementation in a fresh session rather than continuing to implement inside the long-lived main chat.
|
||||||
|
|||||||
@@ -28,7 +28,8 @@
|
|||||||
"Upstream patch committed in external/openclaw-upstream on branch fix/tui-hide-internal-runtime-context commit 0f66a4547 (suppress internal runtime completion context blocks in TUI formatter).",
|
"Upstream patch committed in external/openclaw-upstream on branch fix/tui-hide-internal-runtime-context commit 0f66a4547 (suppress internal runtime completion context blocks in TUI formatter).",
|
||||||
"Validation: pnpm test:fast completed successfully (812 files / 6599 tests passing) at 2026-03-04T22:53:29Z",
|
"Validation: pnpm test:fast completed successfully (812 files / 6599 tests passing) at 2026-03-04T22:53:29Z",
|
||||||
"2026-03-13: confirmed corrected LiteLLM run was still failing (child transcript showed assistant 429/plan error for GLM-5) while runs.json incorrectly stored outcome.status=ok and frozenResultText=null; implemented upstream branch fix/subagent-wait-error-outcome to derive terminal subagent outcome from latest assistant error state, with targeted validation (50 tests passed across 3 files).",
|
"2026-03-13: confirmed corrected LiteLLM run was still failing (child transcript showed assistant 429/plan error for GLM-5) while runs.json incorrectly stored outcome.status=ok and frozenResultText=null; implemented upstream branch fix/subagent-wait-error-outcome to derive terminal subagent outcome from latest assistant error state, with targeted validation (50 tests passed across 3 files).",
|
||||||
"2026-03-13 later: live gpt-5.4 success repro passed (run 23750d80-b481-4f50-b219-cc9245be405f). Live gpt-5.4 failure repro also passed for subagent persistence/announcement handling: child run b50cb91f-6219-44f7-9d2f-a1264ac7ceaf ended with transcript stopReason=error + context_length_exceeded, and runs.json now stored outcome.status=error / endedReason=subagent-error / frozenResultText non-null. Remaining open nuance: raw agent.wait for that same failed child still returned status=ok."
|
"2026-03-13 later: live gpt-5.4 success repro passed (run 23750d80-b481-4f50-b219-cc9245be405f). Live gpt-5.4 failure repro also passed for subagent persistence/announcement handling: child run b50cb91f-6219-44f7-9d2f-a1264ac7ceaf ended with transcript stopReason=error + context_length_exceeded, and runs.json now stored outcome.status=error / endedReason=subagent-error / frozenResultText non-null. Remaining open nuance: raw agent.wait for that same failed child still returned status=ok.",
|
||||||
|
"2026-03-13 later: traced raw agent.wait=status:ok-on-terminal-error to an upstream bug in commands/agent.ts fallback lifecycle emission (phase:end emitted even when resolved run meta.stopReason=error). Added focused upstream fix plus dedupe-path handling/tests on branch fix/subagent-wait-error-outcome; targeted validation passed (81 tests across commands/agent.test.ts, gateway/server-methods/agent-wait-dedupe.test.ts, gateway/server-methods/server-methods.test.ts). Live verification of the new agent.wait behavior remains open."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
|
|||||||
Reference in New Issue
Block a user