Files
swarm-zap/HANDOFF.md

66 lines
5.5 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# HANDOFF.md
## Purpose
Immediate baton-pass for the next fresh implementation session.
## Current objective
Investigate and improve subagent / ACP delegation reliability with evidence-first debugging. The subagent persistence/announcement fix and the raw `agent.wait` semantics fix are now both live-verified on branch `fix/subagent-wait-error-outcome`; the next work should stay tightly scoped to ACP-specific Claude/Codex follow-up. This pass already narrowed that thread to a real bundled-acpx parser bug for Claude-style JSON-RPC auth failures and landed a focused fix/tests. The remaining work is end-to-end OpenClaw ACP-path validation (or a fresh repro of the older exit-code crash notes) plus normal commit/push/PR cleanup when desired.
## Use these state files first
1. `WIP.subagent-reliability.md` — canonical state for this pass
2. `memory/tasks.json` — task tracking for reliability items
3. `memory/2026-03-04-subagent-delegation.md` — earlier delegation context
4. `memory/2026-03-13.md` if present, otherwise append todays evidence there
5. `external/openclaw-upstream/` — for any core-runtime fix work
## Related tasks
- `task-20260304-2215-subagent-reliability` — in progress
- `task-20260304-211216-acp-claude-codex` — open
## Known truths
- TUI noise suppression was already patched locally and upstreamed earlier.
- User still wants actual subagent reliability improved, not just UI noise hidden.
- Historical ACP notes included `Claude: acpx exited with code 1` and `Codex: acpx exited with code 5`, but those exact crashes were **not** reproduced in the latest pass.
- Fresh-session implementation discipline is now the expected approach for non-trivial work.
- One explicit failure mode is already understood: requesting `glm-5` can route into an unavailable GLM-5 provider/entitlement path in this setup.
- A deeper bug was also identified and fixed earlier: a subagent run could finish with terminal assistant errors yet still be recorded as successful with no frozen result.
- Current host state for ACP follow-up:
- bundled plugin-local `acpx` exists and runs
- `~/.openclaw/openclaw.json` currently has no explicit `acp` block / enabled `acpx` plugin entry, so this pass used the smallest direct acpx repro path instead of a full OpenClaw ACP session
- New confirmed acpx/runtime bug from this pass:
- Codex direct acpx path works
- Claude direct acpx path returns top-level JSON-RPC auth errors (`Authentication required`) and exits `0`
- `extensions/acpx/src/runtime-internals/events.ts` previously dropped that JSON-RPC error shape during prompt streaming, so OpenClaw could falsely treat the turn as successful
- A focused upstream fix for that runtime bug now exists on `fix/subagent-wait-error-outcome` with targeted tests passing.
## Highest-priority next actions
1. Treat the generic reliability fixes as live-verified on this branch:
- subagent persistence/announcement proof:
- run id `b50cb91f-6219-44f7-9d2f-a1264ac7ceaf`
- child transcript `~/.openclaw/agents/main/sessions/f114b831-000b-4070-a539-85c68d2b7057.jsonl`
- `runs.json` stores `outcome.status: "error"`, `endedReason: "subagent-error"`, and a non-null `frozenResultText`
- raw `agent.wait` live-fix proof:
- gateway launch: `OPENCLAW_SKIP_CHANNELS=1 CLAWDBOT_SKIP_CHANNELS=1 pnpm exec tsx src/index.ts gateway run --port 18903 --bind loopback --auth none --allow-unconfigured`
- run id: `gwc-live-agent-wait-gpt53-source-fixed2-1773429512008`
- final `agent` response: `finalStatus:"error"`
- `agent.wait`: `{"runId":"gwc-live-agent-wait-gpt53-source-fixed2-1773429512008","status":"error","endedAt":1773429514106,"error":"LLM request rejected: Your input exceeds the context window of this model. Please adjust your input and try again."}`
2. Treat the ACP follow-up as partially closed, not fully done:
- live direct bundled-acpx Codex repro now works and returns `OK`
- live direct bundled-acpx Claude repro returns JSON-RPC auth errors with process exit `0`
- focused upstream fix now maps top-level JSON-RPC prompt errors into ACP runtime `type:"error"` events instead of silently dropping them
- targeted validation passed:
- `cd external/openclaw-upstream && pnpm exec vitest run extensions/acpx/src/runtime-internals/events.test.ts extensions/acpx/src/runtime.test.ts extensions/acpx/src/runtime-internals/control-errors.test.ts`
- result: `22` tests passed across `3` files
3. Next, do end-to-end OpenClaw ACP validation if/when ACP is explicitly enabled here:
- confirm or add the needed `acp` / `acpx` config in `~/.openclaw/openclaw.json` (or equivalent current config path)
- run the smallest real OpenClaw ACP turn/session and confirm Claude auth failures now surface as terminal errors instead of false success
- only reopen the old `acpx exited with code 1/5` thread if a fresh repro appears
4. Commit/push/PR the focused upstream reliability branch when ready.
5. Leave the dirty `/subagents log` UX diff out of this branch unless you intentionally spin a separate focused pass; it regression-passed `src/auto-reply/reply/commands.test.ts` but still lacks dedicated feature coverage.
## Success criteria
- Real-run verification of the new error/outcome fix. ✅ done for subagent persistence/announcement handling.
- Clear separation between resolved reporting bug(s) and any still-open ACP/runtime failures.
- Explicit decision on whether raw `agent.wait` behavior is acceptable or requires a follow-up fix.
- State files updated with paths, commands, and outcomes.