docs(reliability): record agent wait fix diagnosis
This commit is contained in:
15
HANDOFF.md
15
HANDOFF.md
@@ -4,7 +4,7 @@
|
||||
Immediate baton-pass for the next fresh implementation session.
|
||||
|
||||
## Current objective
|
||||
Investigate and improve subagent / ACP delegation reliability with evidence-first debugging. The failure-path proof for the new subagent outcome handling is now captured; the remaining question is whether lower-level `agent.wait` semantics also need fixing or whether the issue is sufficiently solved at the subagent registry / completion layer.
|
||||
Investigate and improve subagent / ACP delegation reliability with evidence-first debugging. The failure-path proof for the new subagent outcome handling is captured, and a focused upstream `agent.wait` semantics fix is now implemented/tested on branch `fix/subagent-wait-error-outcome`; the remaining follow-up is deployment/live verification, not root-cause discovery.
|
||||
|
||||
## Use these state files first
|
||||
1. `WIP.subagent-reliability.md` — canonical state for this pass
|
||||
@@ -31,11 +31,14 @@ Investigate and improve subagent / ACP delegation reliability with evidence-firs
|
||||
- run id `b50cb91f-6219-44f7-9d2f-a1264ac7ceaf`
|
||||
- child transcript `~/.openclaw/agents/main/sessions/f114b831-000b-4070-a539-85c68d2b7057.jsonl`
|
||||
- `runs.json` now stores `outcome.status: "error"`, `endedReason: "subagent-error"`, and a non-null `frozenResultText`
|
||||
2. Decide whether raw gateway `agent.wait` should also report `status: "error"` for terminal assistant errors. Current live evidence for the same failed child:
|
||||
- `agent.wait` returned `{"runId":"b50cb91f-6219-44f7-9d2f-a1264ac7ceaf","status":"ok","endedAt":1773425130881}`
|
||||
3. Re-check whether ACP-specific Claude/Codex runtime failures are still reproducible after separating them from the generic subagent outcome bug.
|
||||
4. Leave the dirty `/subagents log` UX diff out of this branch unless you intentionally spin a separate focused pass; it regression-passed `src/auto-reply/reply/commands.test.ts` but still lacks dedicated feature coverage.
|
||||
5. Update WIP + memory + tasks before ending.
|
||||
2. For raw gateway `agent.wait`, use the new upstream diagnosis/fix rather than re-arguing semantics:
|
||||
- decision: the previous `status:"ok"` was a bug, not intended layering
|
||||
- cause: `src/commands/agent.ts` fallback lifecycle emission used `phase:"end"` even when resolved run `meta.stopReason:"error"`
|
||||
- fix: `src/commands/agent.ts` now emits lifecycle `phase:"error"` with extracted terminal error text in that case; `src/gateway/server-methods/agent-wait-dedupe.ts` also maps resolved agent payloads with terminal `stopReason:"error"` / `aborted:true` to `error` / `timeout`
|
||||
- targeted validation passed: `pnpm -C external/openclaw-upstream test -- --run src/commands/agent.test.ts src/gateway/server-methods/agent-wait-dedupe.test.ts src/gateway/server-methods/server-methods.test.ts`
|
||||
3. If continuing, do a low-noise live verification on the patched gateway/runtime for the same failure class, then report whether raw `agent.wait` now returns `status:"error"` as expected.
|
||||
4. Re-check whether ACP-specific Claude/Codex runtime failures are still reproducible after separating them from the generic subagent outcome bug.
|
||||
5. Leave the dirty `/subagents log` UX diff out of this branch unless you intentionally spin a separate focused pass; it regression-passed `src/auto-reply/reply/commands.test.ts` but still lacks dedicated feature coverage.
|
||||
|
||||
## Success criteria
|
||||
- Real-run verification of the new error/outcome fix. ✅ done for subagent persistence/announcement handling.
|
||||
|
||||
Reference in New Issue
Block a user