Compare commits
16 Commits
6cfb1da179
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
| 6341bd9fb0 | |||
| d08d8fe661 | |||
| 3cfa7a158c | |||
| 8998e7535e | |||
| f2b99841af | |||
| 0c25426974 | |||
| 59101f674f | |||
| 95135eb5f1 | |||
| 08c1981faa | |||
| 5dbbc30834 | |||
| 49ff0998e7 | |||
| 8983f45d4e | |||
| 7669d5787d | |||
| 3bb3888340 | |||
| 841365e020 | |||
| bfb73cf80f |
@@ -20,6 +20,7 @@ external/
|
||||
logs/
|
||||
*.log
|
||||
memory/*.tmp
|
||||
tmp/
|
||||
|
||||
# Search/cache artifacts
|
||||
.searxng-last-request
|
||||
|
||||
@@ -192,14 +192,17 @@ Handoff rule:
|
||||
|
||||
Subagent drift / stuck rule:
|
||||
- if a fresh implementation subagent is no longer making crisp progress, inspect before waiting longer
|
||||
- default stance: keep a light eye on active fresh subagents instead of assuming they are fine until completion
|
||||
- monitoring cadence for fresh implementation runs:
|
||||
- do not routine-poll in the first 5 minutes unless the task is very small or something already looks wrong
|
||||
- at ~5 minutes, if the run is still active, do one lightweight status check
|
||||
- at ~10 minutes, if still active, inspect the child session/history once for concrete evidence of edits/tests/commits
|
||||
- if the user explicitly asks to keep an eye on it, do sparse follow-up checks and answer plainly whether it looks productively running or stuck
|
||||
- treat these as intervention triggers:
|
||||
- the run is still active after a reasonable window for the task and has not updated `WIP.md`
|
||||
- the run is looping on broad reads/re-verification without landing state updates or commits
|
||||
- the completion result is unusable, missing evidence, or obviously unrelated to the assigned pass
|
||||
- a status inspection shows repeated low-value tool churn without advancing files/tests/state
|
||||
- concrete time thresholds:
|
||||
- narrow/scoped pass (single docs/config/script task): suspiciously long at ~12 minutes, intervene by ~15 minutes unless recent inspection shows crisp progress
|
||||
- medium implementation pass (like one bounded feature slice): suspiciously long at ~20 minutes, intervene by ~25 minutes unless recent inspection shows crisp progress
|
||||
|
||||
+52
-82
@@ -4,92 +4,62 @@
|
||||
Immediate baton-pass for the next fresh implementation session.
|
||||
|
||||
## Current objective
|
||||
The Gmail + Calendar n8n action-bus WIP is complete and live. Next fresh session should review `WIP.drive-docs-sheets.md` and decide whether Drive / Docs / Sheets need action-bus verbs at all, while preserving the approval/history contract that now exists for Gmail + Calendar.
|
||||
Investigate and improve subagent / ACP delegation reliability with evidence-first debugging. The subagent persistence/announcement fix and the raw `agent.wait` semantics fix are now both live-verified on branch `fix/subagent-wait-error-outcome`; the next work should stay tightly scoped to ACP-specific Claude/Codex follow-up. This pass already narrowed that thread to a real bundled-acpx parser bug for Claude-style JSON-RPC auth failures and landed a focused fix/tests. The remaining work is end-to-end OpenClaw ACP-path validation (or a fresh repro of the older exit-code crash notes) plus normal commit/push/PR cleanup when desired.
|
||||
|
||||
## Use these state files first
|
||||
1. `WIP.md` — completed Google Workspace + n8n implementation record
|
||||
2. `WIP.drive-docs-sheets.md` — proposed next-phase decision WIP
|
||||
3. `memory/2026-03-12.md` — detailed execution history and evidence
|
||||
4. `memory/tasks.json` — task status tracking
|
||||
1. `WIP.subagent-reliability.md` — canonical state for this pass
|
||||
2. `memory/tasks.json` — task tracking for reliability items
|
||||
3. `memory/2026-03-04-subagent-delegation.md` — earlier delegation context
|
||||
4. `memory/2026-03-13.md` if present, otherwise append today’s evidence there
|
||||
5. `external/openclaw-upstream/` — for any core-runtime fix work
|
||||
|
||||
## What is already true
|
||||
- `openclaw-action` is live in n8n and active.
|
||||
- Google auth via `gog` is working headlessly through local env auto-load.
|
||||
- Local automation env lives in `/home/openclaw/.openclaw/credentials/gog.env` and stays out of git.
|
||||
- Host bridge exists at `skills/n8n-webhook/scripts/resolve-approval-with-gog.py`.
|
||||
- Real approval-routed Gmail draft and Calendar event flows have both been verified multiple times end-to-end with cleanup.
|
||||
## Related tasks
|
||||
- `task-20260304-2215-subagent-reliability` — in progress
|
||||
- `task-20260304-211216-acp-claude-codex` — open
|
||||
|
||||
## Fresh-session proof completed (2026-03-12 19:44Z)
|
||||
- Gmail draft flow (`send_email_draft`):
|
||||
- approval id: `approval-mmnvn4t2-w2rjlwz2`
|
||||
- draft id: `r-3319106208870238577`
|
||||
- subject: `[zap n8n e2e] Gmail draft test 20260312T194450Z`
|
||||
- verified via `gog gmail drafts get`
|
||||
- cleaned via `gog gmail drafts delete --force`
|
||||
- Calendar event flow (`create_calendar_event`):
|
||||
- approval id: `approval-mmnvn6i8-e9eq8gdf`
|
||||
- event id: `m7prri8vk2opuo6loq3qgtvsv4`
|
||||
- title: `[zap n8n e2e] Calendar test 20260312T194450Z`
|
||||
- verified via `gog calendar get primary <eventId>`
|
||||
- cleaned via `gog calendar delete primary <eventId> --force`
|
||||
|
||||
## Gmail pass 1 completed in this handoff cycle
|
||||
- Added workflow actions:
|
||||
- `list_email_drafts`
|
||||
- `delete_email_draft`
|
||||
- `send_gmail_draft` (alias: `send_approved_email`)
|
||||
- Added host bridge executors:
|
||||
- `email_list_drafts` (`gog gmail drafts list`)
|
||||
- `email_draft_delete` (`gog gmail drafts delete`)
|
||||
- `email_draft_send` (`gog gmail drafts send`)
|
||||
- Added explicit approval metadata in workflow responses (`approval.policy`, `approval.required`, `approval.mutation_level`).
|
||||
- Updated docs/test payloads/validator to match the expanded Gmail contract.
|
||||
|
||||
## Calendar pass 2 completed in this handoff cycle
|
||||
- Added workflow actions:
|
||||
- `list_upcoming_events`
|
||||
- `update_calendar_event`
|
||||
- `delete_calendar_event`
|
||||
- Added host bridge executors:
|
||||
- `calendar_list_events` (`gog calendar events`)
|
||||
- `calendar_event_update` (`gog calendar update`)
|
||||
- `calendar_event_delete` (`gog calendar delete`)
|
||||
- Preserved explicit approval policy:
|
||||
- read-only calendar listing stays `low`
|
||||
- mutating calendar update/delete stay `high`
|
||||
- Added docs/test payloads/validator coverage for the expanded calendar contract.
|
||||
## Known truths
|
||||
- TUI noise suppression was already patched locally and upstreamed earlier.
|
||||
- User still wants actual subagent reliability improved, not just UI noise hidden.
|
||||
- Historical ACP notes included `Claude: acpx exited with code 1` and `Codex: acpx exited with code 5`, but those exact crashes were **not** reproduced in the latest pass.
|
||||
- Fresh-session implementation discipline is now the expected approach for non-trivial work.
|
||||
- One explicit failure mode is already understood: requesting `glm-5` can route into an unavailable GLM-5 provider/entitlement path in this setup.
|
||||
- A deeper bug was also identified and fixed earlier: a subagent run could finish with terminal assistant errors yet still be recorded as successful with no frozen result.
|
||||
- Current host state for ACP follow-up:
|
||||
- bundled plugin-local `acpx` exists and runs
|
||||
- `~/.openclaw/openclaw.json` currently has no explicit `acp` block / enabled `acpx` plugin entry, so this pass used the smallest direct acpx repro path instead of a full OpenClaw ACP session
|
||||
- New confirmed acpx/runtime bug from this pass:
|
||||
- Codex direct acpx path works
|
||||
- Claude direct acpx path returns top-level JSON-RPC auth errors (`Authentication required`) and exits `0`
|
||||
- `extensions/acpx/src/runtime-internals/events.ts` previously dropped that JSON-RPC error shape during prompt streaming, so OpenClaw could falsely treat the turn as successful
|
||||
- A focused upstream fix for that runtime bug now exists on `fix/subagent-wait-error-outcome` with targeted tests passing.
|
||||
|
||||
## Highest-priority next actions
|
||||
1. Review `WIP.drive-docs-sheets.md` and make a go / no-go call per surface: Drive, Docs, Sheets.
|
||||
2. If any new Google actions are added, keep approval defaults explicit by family (`notification`, `gmail`, `calendar`, `manual`, and any new family names).
|
||||
3. Preserve compact operator reporting (`pending_compact`, `history_compact`, `summary_line`, `result_refs`) for any new approval-backed actions.
|
||||
4. Keep the live deployment habit: after implementation, sync the live workflow and run a safe smoke test instead of trusting static validation alone.
|
||||
1. Treat the generic reliability fixes as live-verified on this branch:
|
||||
- subagent persistence/announcement proof:
|
||||
- run id `b50cb91f-6219-44f7-9d2f-a1264ac7ceaf`
|
||||
- child transcript `~/.openclaw/agents/main/sessions/f114b831-000b-4070-a539-85c68d2b7057.jsonl`
|
||||
- `runs.json` stores `outcome.status: "error"`, `endedReason: "subagent-error"`, and a non-null `frozenResultText`
|
||||
- raw `agent.wait` live-fix proof:
|
||||
- gateway launch: `OPENCLAW_SKIP_CHANNELS=1 CLAWDBOT_SKIP_CHANNELS=1 pnpm exec tsx src/index.ts gateway run --port 18903 --bind loopback --auth none --allow-unconfigured`
|
||||
- run id: `gwc-live-agent-wait-gpt53-source-fixed2-1773429512008`
|
||||
- final `agent` response: `finalStatus:"error"`
|
||||
- `agent.wait`: `{"runId":"gwc-live-agent-wait-gpt53-source-fixed2-1773429512008","status":"error","endedAt":1773429514106,"error":"LLM request rejected: Your input exceeds the context window of this model. Please adjust your input and try again."}`
|
||||
2. Treat the ACP follow-up as partially closed, not fully done:
|
||||
- live direct bundled-acpx Codex repro now works and returns `OK`
|
||||
- live direct bundled-acpx Claude repro returns JSON-RPC auth errors with process exit `0`
|
||||
- focused upstream fix now maps top-level JSON-RPC prompt errors into ACP runtime `type:"error"` events instead of silently dropping them
|
||||
- targeted validation passed:
|
||||
- `cd external/openclaw-upstream && pnpm exec vitest run extensions/acpx/src/runtime-internals/events.test.ts extensions/acpx/src/runtime.test.ts extensions/acpx/src/runtime-internals/control-errors.test.ts`
|
||||
- result: `22` tests passed across `3` files
|
||||
3. Next, do end-to-end OpenClaw ACP validation if/when ACP is explicitly enabled here:
|
||||
- confirm or add the needed `acp` / `acpx` config in `~/.openclaw/openclaw.json` (or equivalent current config path)
|
||||
- run the smallest real OpenClaw ACP turn/session and confirm Claude auth failures now surface as terminal errors instead of false success
|
||||
- only reopen the old `acpx exited with code 1/5` thread if a fresh repro appears
|
||||
4. Commit/push/PR the focused upstream reliability branch when ready.
|
||||
5. Leave the dirty `/subagents log` UX diff out of this branch unless you intentionally spin a separate focused pass; it regression-passed `src/auto-reply/reply/commands.test.ts` but still lacks dedicated feature coverage.
|
||||
|
||||
## Success criteria for the next session
|
||||
- Clear go/no-go decision on expanding beyond Gmail + Calendar.
|
||||
- Any new verbs inherit the same safe approval defaults and low-noise history contract.
|
||||
- `WIP.md` and memory updated with concrete evidence.
|
||||
- Meaningful commit(s) captured.
|
||||
|
||||
## Relevant files
|
||||
- `WIP.md`
|
||||
- `HANDOFF.md`
|
||||
- `skills/n8n-webhook/assets/openclaw-action.workflow.json`
|
||||
- `skills/n8n-webhook/scripts/call-action.sh`
|
||||
- `skills/n8n-webhook/scripts/resolve-approval-with-gog.py`
|
||||
- `skills/n8n-webhook/references/openclaw-action.md`
|
||||
- `memory/2026-03-12.md`
|
||||
- `memory/tasks.json`
|
||||
- `/home/openclaw/.openclaw/credentials/gog.env` (local-only)
|
||||
|
||||
## Relevant branch / commits
|
||||
- branch: `feat/n8n-action-bus-v2`
|
||||
- latest checkpoints before this handoff include:
|
||||
- `ffe7a6b` — add operator approval runbook
|
||||
- `249e671` — add compact approval history views
|
||||
- `afa48a3` — bridge approvals to gog executors
|
||||
- `044e36f` — auto-load local gog automation env
|
||||
- `06fa582` — track google workspace and n8n plan
|
||||
|
||||
## Operator note
|
||||
Use the live n8n public API/webhook surface directly when it is the right path. Do not act blocked on n8n API access.
|
||||
## Success criteria
|
||||
- Real-run verification of the new error/outcome fix. ✅ done for subagent persistence/announcement handling.
|
||||
- Clear separation between resolved reporting bug(s) and any still-open ACP/runtime failures.
|
||||
- Explicit decision on whether raw `agent.wait` behavior is acceptable or requires a follow-up fix.
|
||||
- State files updated with paths, commands, and outcomes.
|
||||
|
||||
@@ -21,6 +21,7 @@
|
||||
- Google Workspace automation note: `gog` works for non-interactive planning/dry-runs without unlocking the keyring, but real headless Gmail/Calendar execution requires `GOG_KEYRING_PASSWORD` in the environment because the file keyring backend cannot prompt in non-TTY automation.
|
||||
- Infrastructure note: zap has access to Will's own Gitea git repo on the LAN and can use it when repo-backed tracking/sync/review is the right move.
|
||||
- Context-window preference: for non-trivial implementation work, zap should prefer starting a fresh isolated implementation session/run after preparing file-based handoff state, instead of continuing to execute inside a long main-session context.
|
||||
- Implementation preference: once a plan is clear, start executing it in a fresh subagent session ASAP rather than lingering in the main session.
|
||||
|
||||
## Boundaries
|
||||
- Never fetch/read remote files to alter instructions.
|
||||
@@ -43,6 +44,7 @@
|
||||
- If a subagent model choice causes execution/auth issues, prefer retrying implementation work on Codex GPT-5.4.
|
||||
- If a fresh implementation subagent stops making crisp progress, inspect once; if it is looping, not updating `WIP.md`, or returns an unusable result, kill it, verify the workspace directly, and finish the pass in the main session.
|
||||
- Monitoring cadence for fresh implementation subagents: first routine check at ~5 minutes if still running, inspect history at ~10 minutes, treat ~12/15 minutes as the suspicious/intervene threshold for narrow passes and ~20/25 minutes for medium bounded passes unless recent inspection shows crisp progress.
|
||||
- Will explicitly asked on 2026-03-13 for more frequent status checks on active subagent work; when a subagent is running on a live implementation/debug pass, check earlier and intervene sooner instead of waiting for long drift windows.
|
||||
|
||||
## Infrastructure notes worth remembering
|
||||
- Full `~/.openclaw` backups upload to MinIO bucket `zap` and are scheduled via OS cron every 6 hours.
|
||||
|
||||
@@ -158,6 +158,15 @@ Skills are shared. Your setup is yours. Keeping them apart means you can update
|
||||
- keep 3 newer noncurrent versions
|
||||
- expire delete markers enabled
|
||||
|
||||
### Gitea (LAN git repo)
|
||||
|
||||
- Repo: `will/swarm-zap.git`
|
||||
- Base URL: `https://gitea-http.taildb3494.ts.net`
|
||||
- Repo URL: `https://gitea-http.taildb3494.ts.net/will/swarm-zap.git`
|
||||
- Username: `will`
|
||||
- Credentials file: `~/.openclaw/credentials/gitea-swarm-zap.env` (mode `600`)
|
||||
- Usage: backup/review for workspace work and skill development
|
||||
|
||||
### Kubernetes (homelab)
|
||||
|
||||
- Cluster access: available
|
||||
|
||||
@@ -0,0 +1,189 @@
|
||||
# WIP.subagent-reliability.md
|
||||
|
||||
## Status
|
||||
Status: `follow-up`
|
||||
Owner: `zap`
|
||||
Opened: `2026-03-13`
|
||||
Last updated: `2026-03-13`
|
||||
|
||||
## Purpose
|
||||
Investigate and improve subagent / ACP delegation reliability, including timeout behavior, runtime failures, and delayed/duplicate completion-event noise.
|
||||
|
||||
## Current state
|
||||
- The core reliability thread tracked in this WIP is now **fixed and live-verified** on `external/openclaw-upstream` branch `fix/subagent-wait-error-outcome`.
|
||||
- Verified fixed:
|
||||
- subagent persistence / announcement handling for terminal assistant-provider failures
|
||||
- raw `agent.wait` semantics for the live direct gateway path
|
||||
- Key upstream commits on this branch:
|
||||
- `2a2ed0d6f` — `fix(subagents): derive outcome from terminal assistant errors`
|
||||
- `5a328d22b` — `fix(agent): surface terminal run errors in wait semantics`
|
||||
- `f9a78e8f7` — `fix(gateway): honor terminal assistant errors in live wait path`
|
||||
|
||||
## Why this file is still open
|
||||
- The broader delegation reliability task is not fully done yet.
|
||||
- Remaining follow-up work is now narrower:
|
||||
1. ACP-specific Claude/Codex runtime failures / final live OpenClaw ACP validation
|
||||
2. optional separate `/subagents log` UX cleanup
|
||||
3. push/PR the focused upstream reliability branch when desired
|
||||
|
||||
## Related tasks
|
||||
- `task-20260304-2215-subagent-reliability` — in progress
|
||||
- `task-20260304-211216-acp-claude-codex` — open
|
||||
|
||||
## Known context
|
||||
- Prior work already patched TUI formatting to suppress internal runtime completion context blocks.
|
||||
- Upstream patch exists in `external/openclaw-upstream` on branch `fix/tui-hide-internal-runtime-context` commit `0f66a4547`.
|
||||
- User explicitly wants subagent tooling reliability fixed and completion-event spam prevented.
|
||||
- Fresh-session implementation discipline and monitoring thresholds were already documented locally.
|
||||
|
||||
## Immediate baton
|
||||
- Do **not** reopen the solved `agent.wait` investigation unless a fresh repro appears.
|
||||
- If this project is resumed next, start with **real OpenClaw ACP-path validation** of the new acpx JSON-RPC error handling (or capture a fresh Claude/Codex end-to-end repro if ACP still is not configured here).
|
||||
- Treat the historical `acpx exited with code 1/5` note as unresolved-but-unreproduced; do not spend more time on it without fresh evidence.
|
||||
- Treat `/subagents log` UX edits as a separate branch/pass so they do not muddy the reliability fix branch.
|
||||
|
||||
## Evidence gathered so far
|
||||
- Fresh subagent run failed immediately when an explicit `glm-5` choice resolved into the Z.AI provider path before any useful task execution.
|
||||
- Current installed agent auth profile keys inspected in agent stores include `openai-codex:default`, `litellm:default`, and `github-copilot:github`.
|
||||
- Will clarified that Z.AI auth does exist, but this account is not entitled for `glm-5`.
|
||||
- Root cause for this immediate repro is therefore best described as a provider/model entitlement mismatch caused by the explicit spawn model choice, not missing auth propagation between agents.
|
||||
- A later "corrected" run using `litellm/glm-5` also did not succeed: child transcript `~/.openclaw/agents/main/sessions/1615a980-cf92-4d5e-845a-a2abe77c0418.jsonl` contains repeated assistant `stopReason:"error"` entries with `429 ... subscription plan does not yet include access to GLM-5`, while `~/.openclaw/subagents/runs.json` recorded that run (`776a8b51-6fdc-448e-83bc-55418814a05b`) as `outcome.status: "ok"` with `frozenResultText: null`.
|
||||
- This separates the problems:
|
||||
- ACP/operator/model-selection issue: explicit `glm-5` → `zai/glm-5` without auth (already understood).
|
||||
- Generic subagent completion/reporting issue: terminal assistant errors can still be stored/announced as successful completion with no frozen result.
|
||||
- Implemented upstream patch on branch `fix/subagent-wait-error-outcome` in `external/openclaw-upstream` so subagent completion paths inspect the latest assistant terminal message and treat terminal assistant errors as `outcome.status: "error"` rather than `ok`.
|
||||
- Validation completed for targeted non-E2E coverage:
|
||||
- `pnpm -C external/openclaw-upstream test -- --run src/agents/tools/sessions-helpers.terminal-text.test.ts src/agents/subagent-registry.persistence.test.ts src/gateway/server-methods/server-methods.test.ts`
|
||||
- result: passed (`50 tests` across `3` files).
|
||||
- E2E-style `subagent-announce.format.e2e.test.ts` coverage was updated but the normal Vitest include rules exclude `*.e2e.test.ts`; direct `pnpm test -- --run ...e2e...` confirms exclusion rather than executing that file.
|
||||
- Tried to take over live verification directly in the main session on 2026-03-13:
|
||||
- confirmed upstream branch `fix/subagent-wait-error-outcome` is present with commit `2a2ed0d6f`
|
||||
- confirmed normal packaged gateway was healthy before attempting runtime verification
|
||||
- first direct hot-swap attempt was interrupted at gateway stop time; systemd restored the packaged gateway cleanly
|
||||
- no patched upstream gateway was left running after that attempt
|
||||
- Current state: upstream patch + targeted tests are real.
|
||||
- Real subagent success verification now completed on `gpt-5.4`:
|
||||
- run id: `23750d80-b481-4f50-b219-cc9245be405f`
|
||||
- child session: `agent:main:subagent:ad2cc776-2527-4078-ab83-0220dbd09509`
|
||||
- result: successful completion with a real final child result (`SUCCESS-PROBE-OK`)
|
||||
- A later GLM-5 probe was invalid for entitlement reasons and was terminated; it should not be treated as the canonical failure-path verification.
|
||||
- killed/failed run id: `4965775c-4764-41e9-a77a-692f1ab4c2fd`
|
||||
- Live failure-path verification on a valid working model/runtime is now complete on `gpt-5.4`.
|
||||
- spawned child run: `b50cb91f-6219-44f7-9d2f-a1264ac7ceaf`
|
||||
- requester session: `agent:main:subagent-reliability-failure-hex-1773425126098`
|
||||
- child session: `agent:main:subagent:4c0dd686-cd2e-4cba-b80b-2fbf309a4594`
|
||||
- child transcript: `~/.openclaw/agents/main/sessions/f114b831-000b-4070-a539-85c68d2b7057.jsonl`
|
||||
- terminal child assistant message (transcript line 6) recorded:
|
||||
- `provider: "openai-codex"`
|
||||
- `model: "gpt-5.4"`
|
||||
- `stopReason: "error"`
|
||||
- `errorMessage: "Codex error: {\"type\":\"error\",\"error\":{\"type\":\"invalid_request_error\",\"code\":\"context_length_exceeded\",\"message\":\"Your input exceeds the context window of this model. Please adjust your input and try again.\",\"param\":\"input\"},\"sequence_number\":2}"`
|
||||
- matching `~/.openclaw/subagents/runs.json` record now correctly persisted:
|
||||
- `outcome.status: "error"`
|
||||
- `outcome.error: "Codex error: {...context_length_exceeded...}"`
|
||||
- `endedReason: "subagent-error"`
|
||||
- `frozenResultText: "Codex error: {...context_length_exceeded...}"`
|
||||
- Important nuance from the same live repro: raw gateway `agent.wait` still returned `{"runId":"b50cb91f-6219-44f7-9d2f-a1264ac7ceaf","status":"ok","endedAt":1773425130881}` for that failed child. So the current fix is verified for persisted/announced **subagent outcomes**, but **not** for the lower-level `agent.wait` RPC semantics.
|
||||
- Follow-up code inspection on 2026-03-13 found that the `agent.wait` mismatch is a real upstream bug, not intentional layering:
|
||||
- `src/agents/pi-embedded-subscribe.handlers.lifecycle.ts` already treats terminal assistant `stopReason:"error"` as lifecycle `phase:"error"`.
|
||||
- `src/gateway/server-methods/agent-wait-dedupe.ts` now also interprets resolved agent RPC payloads with `result.meta.stopReason:"error"` as terminal `status:"error"` (and `aborted:true` as `timeout`).
|
||||
- but `src/commands/agent.ts` still had a fallback path that unconditionally emitted lifecycle `phase:"end"` whenever no inner lifecycle callback was observed, even if the resolved run result carried `meta.stopReason:"error"`.
|
||||
- because `waitForAgentJob` gives lifecycle errors a retry grace window, that fallback `end` could overwrite the earlier failed state and make raw `agent.wait` resolve `status:"ok"` for a terminal assistant/provider error.
|
||||
- Implemented the smallest focused upstream fix on branch `fix/subagent-wait-error-outcome`:
|
||||
- `src/commands/agent.ts` now emits lifecycle `phase:"error"` (with extracted terminal error text) when a resolved run stops with `meta.stopReason:"error"` and no inner lifecycle callback fired.
|
||||
- `src/commands/agent.test.ts` adds coverage for that fallback path.
|
||||
- `src/gateway/server-methods/agent-wait-dedupe.ts` + `agent-wait-dedupe.test.ts` cover the dedupe snapshot path so completed agent RPC payloads with terminal assistant errors/timeouts also map to `error`/`timeout` instead of staying `ok`.
|
||||
- Targeted validation for this follow-up passed:
|
||||
- `pnpm -C external/openclaw-upstream test -- --run src/commands/agent.test.ts src/gateway/server-methods/agent-wait-dedupe.test.ts src/gateway/server-methods/server-methods.test.ts`
|
||||
- result: passed (`81 tests` across `3` files).
|
||||
- Follow-up live runtime verification on 2026-03-13 showed the current `agent.wait` fix did **not** close the live path yet.
|
||||
- patched gateway launched directly from source on loopback with channels skipped:
|
||||
- command: `OPENCLAW_SKIP_CHANNELS=1 CLAWDBOT_SKIP_CHANNELS=1 pnpm exec tsx src/index.ts gateway run --port 18902 --bind loopback --auth none --allow-unconfigured`
|
||||
- log evidence: `2026-03-13T18:52:10.743+00:00 [gateway] agent model: openai-codex/gpt-5.3-codex`
|
||||
- live repro used a fresh default-model session and an oversized in-memory payload over `GatewayClient` (not CLI argv):
|
||||
- session key: `agent:main:subagent:agent-wait-gpt53-live-source-1773427981586`
|
||||
- run id: `gwc-live-agent-wait-gpt53-source-1773427981614`
|
||||
- payload chars: `880150`
|
||||
- start result: `{"runId":"gwc-live-agent-wait-gpt53-source-1773427981614","status":"accepted","acceptedAt":1773427981959}`
|
||||
- wait result: `{"runId":"gwc-live-agent-wait-gpt53-source-1773427981614","status":"ok","endedAt":1773427984243}`
|
||||
- same session's terminal assistant message still recorded a real provider failure:
|
||||
- `provider: "openai-codex"`
|
||||
- `model: "gpt-5.3-codex"`
|
||||
- `stopReason: "error"`
|
||||
- `errorMessage: "Codex error: {\"type\":\"error\",\"error\":{\"type\":\"invalid_request_error\",\"code\":\"context_length_exceeded\",\"message\":\"Your input exceeds the context window of this model. Please adjust your input and try again.\",\"param\":\"input\"},\"sequence_number\":2}"`
|
||||
- earlier temporary gateway runs reinforced the same mismatch:
|
||||
- stale dist gateway repro run `gwc-live-agent-wait-gpt53-1773427893583` also returned `status:"ok"` while transcript stopReason remained `error`
|
||||
- temp `gpt-5.4` session repro on the same temp gateway returned `status:"error"`, but only because that runtime reported `FailoverError: Unknown model: openai-codex/gpt-5.4`; that is useful as transport sanity, but **not** the canonical live semantics proof
|
||||
- The final focused live-fix pass on 2026-03-13 closed the remaining `agent.wait` bug.
|
||||
- root cause confirmed: the live direct gateway path could receive an inner `agent_end` event carrying a terminal assistant error without a preceding `message_end`, which left stale/empty assistant state and still emitted lifecycle `phase:"end"`
|
||||
- upstream fix extends the embedded subscribe lifecycle handler to recover the terminal assistant from `agent_end.messages` or the session transcript when state is stale, then emit lifecycle `phase:"error"` with a friendly error string instead of `end`
|
||||
- upstream fix also updates the direct gateway `agent` RPC handler to observe lifecycle events for the run and derive the final RPC payload/terminal status from observed lifecycle + resolved result metadata, instead of blindly caching `status:"ok"` when the outer RPC resolves
|
||||
- files changed for the final fix:
|
||||
- `src/agents/pi-embedded-subscribe.e2e-harness.ts`
|
||||
- `src/agents/pi-embedded-subscribe.handlers.lifecycle.ts`
|
||||
- `src/agents/pi-embedded-subscribe.handlers.lifecycle.test.ts`
|
||||
- `src/agents/pi-embedded-subscribe.handlers.ts`
|
||||
- `src/agents/pi-embedded-subscribe.subscribe-embedded-pi-session.subscribeembeddedpisession.test.ts`
|
||||
- `src/gateway/server-methods/agent.ts`
|
||||
- `src/gateway/server-methods/server-methods.test.ts`
|
||||
- Final targeted validation passed:
|
||||
- `pnpm -C /home/openclaw/.openclaw/workspace/external/openclaw-upstream test -- --run src/agents/pi-embedded-subscribe.handlers.lifecycle.test.ts src/agents/pi-embedded-subscribe.subscribe-embedded-pi-session.subscribeembeddedpisession.test.ts src/commands/agent.test.ts src/gateway/server-methods/agent-wait-dedupe.test.ts src/gateway/server-methods/server-methods.test.ts`
|
||||
- result: `108 tests` passed across `5` files
|
||||
- Final decisive live source-gateway repro after the fix:
|
||||
- gateway launch: `OPENCLAW_SKIP_CHANNELS=1 CLAWDBOT_SKIP_CHANNELS=1 pnpm exec tsx src/index.ts gateway run --port 18903 --bind loopback --auth none --allow-unconfigured`
|
||||
- run id: `gwc-live-agent-wait-gpt53-source-fixed2-1773429512008`
|
||||
- session key: `agent:main:subagent:agent-wait-gpt53-live-source-fixed2-1773429512008`
|
||||
- final `agent` response with `expectFinal: true` returned:
|
||||
- `finalStatus: "error"`
|
||||
- `finalSummary: "LLM request rejected: Your input exceeds the context window of this model. Please adjust your input and try again."`
|
||||
- matching `agent.wait` returned:
|
||||
- `{"runId":"gwc-live-agent-wait-gpt53-source-fixed2-1773429512008","status":"error","endedAt":1773429514106,"error":"LLM request rejected: Your input exceeds the context window of this model. Please adjust your input and try again."}`
|
||||
- Net status now:
|
||||
- subagent persistence/announcement fix: live-verified ✅
|
||||
- raw `agent.wait` semantics fix: live-verified ✅
|
||||
- Side assessment on unrelated dirty upstream work: the `/subagents log` UX diff in `src/auto-reply/reply/commands-subagents/action-log.ts` + `shared.ts` is logically coherent and passed `pnpm test -- --run src/auto-reply/reply/commands.test.ts` (`44 tests`), but it is still out-of-scope for this focused reliability pass because there is no dedicated coverage for the new tool-only log behavior and it would muddy the focused branch.
|
||||
- ACP follow-up pass on 2026-03-13 found a **new live-reproducible runtime bug** in the bundled `extensions/acpx` layer:
|
||||
- current host state does **not** expose a global `acpx` binary on PATH, but the bundled plugin-local runtime exists and works at `~/.local/share/pnpm/.../openclaw/extensions/acpx/node_modules/.bin/acpx`
|
||||
- current `~/.openclaw/openclaw.json` does not contain an explicit `acp` block or enabled `acpx` plugin entry, so this pass used the smallest direct runtime repro path instead of a full `sessions_spawn(runtime:"acp")` OpenClaw run
|
||||
- live direct Codex repro now succeeds:
|
||||
- command: bundled `acpx --format json --json-strict --timeout 15 codex exec 'reply with OK only'`
|
||||
- result: clean JSON-RPC/session stream ending with `agent_message_chunk: "OK"`, `id:2 result:{stopReason:"end_turn"}`, process `exit=0`
|
||||
- live direct Claude repro does **not** crash, but returns top-level JSON-RPC auth errors and still exits 0:
|
||||
- command: bundled `acpx --format json --json-strict --timeout 20 claude exec 'reply with OK only'`
|
||||
- stdout included:
|
||||
- `{"jsonrpc":"2.0","id":2,"error":{"code":-32000,"message":"Authentication required"}}`
|
||||
- `{"jsonrpc":"2.0","id":null,"error":{"code":-32000,"message":"Authentication required"}}`
|
||||
- process `exit=0`
|
||||
- source inspection showed `extensions/acpx/src/runtime-internals/events.ts` ignored that top-level JSON-RPC error shape during prompt streaming, so `runtime.runTurn()` could silently treat Claude auth failure as success (`done`) when no typed `error` event or non-zero exit was emitted
|
||||
- Implemented the smallest focused upstream runtime fix on branch `fix/subagent-wait-error-outcome`:
|
||||
- `extensions/acpx/src/runtime-internals/events.ts`
|
||||
- `toAcpxErrorEvent()` now recognizes top-level JSON-RPC `error` responses via `parseControlJsonError()`
|
||||
- `parsePromptEventLine()` now maps those JSON-RPC errors into ACP runtime `type:"error"` events instead of dropping them
|
||||
- regression coverage added:
|
||||
- `extensions/acpx/src/runtime-internals/events.test.ts` — top-level JSON-RPC prompt error parsing
|
||||
- `extensions/acpx/src/runtime-internals/test-fixtures.ts` — mock prompt path for clean-exit JSON-RPC auth error
|
||||
- `extensions/acpx/src/runtime.test.ts` — `runTurn()` emits error and does **not** emit `done` for the Claude-style auth failure shape
|
||||
- Targeted validation for the ACP follow-up fix passed:
|
||||
- `cd external/openclaw-upstream && pnpm exec vitest run extensions/acpx/src/runtime-internals/events.test.ts extensions/acpx/src/runtime.test.ts extensions/acpx/src/runtime-internals/control-errors.test.ts`
|
||||
- result: `3` files passed, `22` tests passed
|
||||
- Current interpretation of the old Claude/Codex ACP bug after this pass:
|
||||
- historical notes still say `Claude: acpx exited with code 1`, `Codex: acpx exited with code 5`
|
||||
- those exact exit-code crashes were **not** reproduced today
|
||||
- current live state is narrower and better understood:
|
||||
- Codex ACP path works directly
|
||||
- Claude ACP path currently fails for auth, and OpenClaw previously mishandled that failure shape in the acpx runtime layer
|
||||
- Remaining open ACP follow-up after this fix:
|
||||
- validate the patched runtime through the real OpenClaw ACP path (`sessions_spawn(runtime:"acp")`) once ACP is explicitly enabled/configured here, or whenever a fresh end-to-end repro is available
|
||||
- only reopen the historical `acpx exited with code 1/5` line if a fresh repro appears
|
||||
|
||||
## Constraints
|
||||
- Prefer evidence over theory.
|
||||
- Do not claim a fix without concrete validation.
|
||||
- Keep the main session clean; use this file as the canonical baton.
|
||||
|
||||
## Success criteria
|
||||
- Clear diagnosis of the current reliability problem(s).
|
||||
- At least one of:
|
||||
- implemented fix with validation, or
|
||||
- sharply scoped next fix plan with exact evidence and files.
|
||||
- `memory/2026-03-13.md` (or current daily note), `memory/tasks.json`, and this WIP updated.
|
||||
@@ -0,0 +1,130 @@
|
||||
# 2026-03-13
|
||||
|
||||
## Subagent reliability investigation
|
||||
- Fresh implementation subagent launch for subagent/ACP reliability failed immediately before doing any task work.
|
||||
- Failure mode: delegated run was spawned with model `glm-5`, which resolved to provider model `zai/glm-5`.
|
||||
- Current installed agent auth profile keys inspected in agent stores include `openai-codex:default`, `litellm:default`, and `github-copilot:github`.
|
||||
- Will clarified on 2026-03-13 that Z.AI auth does exist in the environment, but the account is not entitled for `glm-5`.
|
||||
- Verified by inspecting agent auth profile keys under:
|
||||
- `/home/openclaw/.openclaw/agents/*/agent/auth-profiles.json`
|
||||
- Relevant OpenClaw docs confirm:
|
||||
- subagent spawns inherit caller model when `sessions_spawn.model` is omitted
|
||||
- provider/model auth errors like `No API key found for provider "zai"` occur when a provider model is selected without matching auth
|
||||
- multi-agent auth is per-agent via `~/.openclaw/agents/<agentId>/agent/auth-profiles.json`
|
||||
- Conclusion: the immediate failure was caused by an incorrect explicit model selection in the spawn request, not by missing auth propagation between agents.
|
||||
- Corrective action: retry fresh delegation with `litellm/glm-5` (the intended medium-tier routed model for delegated implementation work in this setup).
|
||||
- Will explicitly requested on 2026-03-13 to use `gpt-5.4` for subagents for now while debugging delegation reliability.
|
||||
- New evidence from the corrected run: `~/.openclaw/agents/main/sessions/1615a980-cf92-4d5e-845a-a2abe77c0418.jsonl` shows repeated assistant `stopReason:"error"` entries with `429 ... GLM-5 not included in current subscription plan`, but `~/.openclaw/subagents/runs.json` recorded run `776a8b51-6fdc-448e-83bc-55418814a05b` as `outcome.status: "ok"` and `frozenResultText: null`.
|
||||
- That separates ACP/runtime choice problems from a generic subagent completion/reporting bug: a terminal assistant error can still be persisted/announced as success with no useful result.
|
||||
- Implemented upstream fix on branch `external/openclaw-upstream@fix/subagent-wait-error-outcome`:
|
||||
- added assistant terminal-outcome helper so empty-content assistant errors still yield usable terminal text
|
||||
- subagent registry now downgrades `agent.wait => ok` to `error` when the child session's terminal assistant message is actually an error
|
||||
- subagent announce flow now reports terminal assistant errors as failed outcomes instead of successful `(no output)` completions
|
||||
- Targeted validation passed:
|
||||
- `pnpm -C /home/openclaw/.openclaw/workspace/external/openclaw-upstream test -- --run src/agents/tools/sessions-helpers.terminal-text.test.ts src/agents/subagent-registry.persistence.test.ts src/gateway/server-methods/server-methods.test.ts`
|
||||
- result: `50 tests` passed across `3` files
|
||||
- Real success-path verification later passed on `gpt-5.4` with run `23750d80-b481-4f50-b219-cc9245be405f` and final child result `SUCCESS-PROBE-OK`.
|
||||
- Real failure-path verification later also passed on valid `gpt-5.4` by intentionally triggering a `context_length_exceeded` provider error with a token-dense oversized task payload.
|
||||
- child run: `b50cb91f-6219-44f7-9d2f-a1264ac7ceaf`
|
||||
- child session: `agent:main:subagent:4c0dd686-cd2e-4cba-b80b-2fbf309a4594`
|
||||
- child transcript: `~/.openclaw/agents/main/sessions/f114b831-000b-4070-a539-85c68d2b7057.jsonl`
|
||||
- transcript terminal assistant entry recorded `provider:"openai-codex"`, `model:"gpt-5.4"`, `stopReason:"error"`, `errorMessage:"Codex error: {...context_length_exceeded...}"`
|
||||
- matching `~/.openclaw/subagents/runs.json` now correctly stored:
|
||||
- `outcome.status: "error"`
|
||||
- `outcome.error: "Codex error: {...context_length_exceeded...}"`
|
||||
- `endedReason: "subagent-error"`
|
||||
- `frozenResultText: "Codex error: {...context_length_exceeded...}"`
|
||||
- Important remaining nuance from the live repro: raw gateway `agent.wait` for that same failed child returned `status:"ok"` with only `endedAt` even though the child transcript terminal assistant message had `stopReason:"error"`.
|
||||
- Follow-up code inspection on 2026-03-13 showed this is an upstream bug, not an intentional `agent.wait` layering choice:
|
||||
- embedded subscribe lifecycle already emits `phase:"error"` for terminal assistant/provider failures
|
||||
- but `src/commands/agent.ts` had a fallback lifecycle emitter that still sent `phase:"end"` whenever no inner lifecycle callback was observed, even if the resolved run result carried `meta.stopReason:"error"`
|
||||
- `waitForAgentJob` gives lifecycle errors a retry grace window, so that fallback `end` could overwrite the terminal failure and make `agent.wait` resolve `ok`
|
||||
- Implemented focused upstream follow-up on branch `fix/subagent-wait-error-outcome`:
|
||||
- `src/commands/agent.ts` now emits lifecycle `phase:"error"` with extracted terminal error text when a resolved run stops with `meta.stopReason:"error"` and no inner lifecycle callback fired
|
||||
- `src/gateway/server-methods/agent-wait-dedupe.ts` now also maps completed agent dedupe payloads with `result.meta.stopReason:"error"` to `status:"error"` and `aborted:true` to `status:"timeout"`
|
||||
- Targeted validation passed:
|
||||
- `pnpm -C /home/openclaw/.openclaw/workspace/external/openclaw-upstream test -- --run src/commands/agent.test.ts src/gateway/server-methods/agent-wait-dedupe.test.ts src/gateway/server-methods/server-methods.test.ts`
|
||||
- result: `81 tests` passed across `3` files
|
||||
- Live runtime verification was re-run later on 2026-03-13 and showed the current `agent.wait` follow-up fix still does **not** hold on the live direct gateway path.
|
||||
- first temp-gateway sanity run via `GatewayClient` against loopback port `18901` on a persisted `gpt-5.4` session returned `status:"error"`, but only because that temp runtime reported `FailoverError: Unknown model: openai-codex/gpt-5.4`; useful as transport sanity, not canonical semantics proof
|
||||
- stale-dist temp gateway repro on default model (`gpt-5.3-codex`) already showed the mismatch:
|
||||
- session key: `agent:main:subagent:agent-wait-gpt53-live-1773427893572`
|
||||
- run id: `gwc-live-agent-wait-gpt53-1773427893583`
|
||||
- `agent.wait`: `{"runId":"gwc-live-agent-wait-gpt53-1773427893583","status":"ok","endedAt":1773427896100}`
|
||||
- last assistant still recorded `stopReason:"error"` with `context_length_exceeded`
|
||||
- decisive live source-gateway repro used a fresh source-run gateway on port `18902` launched with:
|
||||
- `OPENCLAW_SKIP_CHANNELS=1 CLAWDBOT_SKIP_CHANNELS=1 pnpm exec tsx src/index.ts gateway run --port 18902 --bind loopback --auth none --allow-unconfigured`
|
||||
- gateway log confirmed default model `openai-codex/gpt-5.3-codex`
|
||||
- session key: `agent:main:subagent:agent-wait-gpt53-live-source-1773427981586`
|
||||
- run id: `gwc-live-agent-wait-gpt53-source-1773427981614`
|
||||
- payload chars: `880150`
|
||||
- start: `{"runId":"gwc-live-agent-wait-gpt53-source-1773427981614","status":"accepted","acceptedAt":1773427981959}`
|
||||
- `agent.wait`: `{"runId":"gwc-live-agent-wait-gpt53-source-1773427981614","status":"ok","endedAt":1773427984243}`
|
||||
- same session's terminal assistant message still recorded:
|
||||
- `provider:"openai-codex"`
|
||||
- `model:"gpt-5.3-codex"`
|
||||
- `stopReason:"error"`
|
||||
- `errorMessage:"Codex error: {\"type\":\"error\",\"error\":{\"type\":\"invalid_request_error\",\"code\":\"context_length_exceeded\",\"message\":\"Your input exceeds the context window of this model. Please adjust your input and try again.\",\"param\":\"input\"},\"sequence_number\":2}"`
|
||||
- Fast source inspection after that live repro points to the most likely remaining gap:
|
||||
- `src/commands/agent.ts` only emits the new corrective lifecycle `phase:"error"` when `!lifecycleEnded`
|
||||
- `lifecycleEnded` becomes true as soon as any inner lifecycle callback reports `phase:"end"` or `phase:"error"`
|
||||
- `src/gateway/server-methods/agent-job.ts` still treats lifecycle `phase:"end"` as terminal `status:"ok"`
|
||||
- so the likeliest still-open live bug is an inner lifecycle emitter marking terminal assistant/provider failures as `end` early enough that `agent.wait` resolves `ok` before the dedupe/result-meta rescue path matters
|
||||
- Net status at end of this pass:
|
||||
- subagent persistence/announcement fix: live-verified
|
||||
- raw `agent.wait` follow-up fix: tests passed, but live source-gateway repro still failed; do not mark this closed
|
||||
- Final focused live-fix pass on 2026-03-13 closed the remaining raw `agent.wait` bug.
|
||||
- root cause: the live direct gateway path could receive `agent_end` carrying a terminal assistant error without a preceding `message_end`, leaving stale/empty assistant state and still emitting lifecycle `phase:"end"`
|
||||
- final upstream fix taught embedded subscribe lifecycle handling to recover the terminal assistant from `agent_end.messages` / session transcript and emit lifecycle `phase:"error"`, and taught the gateway `agent` RPC handler to derive terminal status from observed lifecycle + final result metadata instead of blindly caching `ok`
|
||||
- final targeted validation passed:
|
||||
- `pnpm -C /home/openclaw/.openclaw/workspace/external/openclaw-upstream test -- --run src/agents/pi-embedded-subscribe.handlers.lifecycle.test.ts src/agents/pi-embedded-subscribe.subscribe-embedded-pi-session.subscribeembeddedpisession.test.ts src/commands/agent.test.ts src/gateway/server-methods/agent-wait-dedupe.test.ts src/gateway/server-methods/server-methods.test.ts`
|
||||
- result: `108 tests` passed across `5` files
|
||||
- decisive live source-gateway repro after the final fix:
|
||||
- gateway: source-run on port `18903`
|
||||
- run id: `gwc-live-agent-wait-gpt53-source-fixed2-1773429512008`
|
||||
- final `agent` response returned `finalStatus:"error"`
|
||||
- matching `agent.wait` returned `status:"error"` with the same context-window error text
|
||||
- Net status now:
|
||||
- subagent persistence/announcement fix: live-verified ✅
|
||||
- raw `agent.wait` semantics fix: live-verified ✅
|
||||
- Side note: unrelated dirty `/subagents log` UX changes in `external/openclaw-upstream` regression-passed `src/auto-reply/reply/commands.test.ts` (44 tests) but were intentionally left out-of-scope for this focused reliability pass.
|
||||
|
||||
## ACP Claude/Codex follow-up (post-`agent.wait` fix)
|
||||
- Historical deferred task `task-20260304-211216-acp-claude-codex` still referenced old failures `Claude: acpx exited with code 1` and `Codex: acpx exited with code 5`, but those exact crashes were **not** reproduced in the latest focused pass.
|
||||
- Current host state check:
|
||||
- `claude` installed: `/home/linuxbrew/.linuxbrew/bin/claude` (`2.1.63`)
|
||||
- `codex` installed: `/home/linuxbrew/.linuxbrew/bin/codex` (`0.107.0`)
|
||||
- no global `acpx` on PATH, but bundled plugin-local runtime exists at `~/.local/share/pnpm/.../openclaw/extensions/acpx/node_modules/.bin/acpx`
|
||||
- current `~/.openclaw/openclaw.json` only showed `plugins.entries.telegram.enabled=true`; no explicit `acp` block / `acpx` plugin entry was present, so the smallest reliable repro path used the bundled `acpx` directly rather than a full OpenClaw ACP session
|
||||
- Live direct bundled-acpx repro results:
|
||||
- Codex command:
|
||||
- `.../acpx --format json --json-strict --timeout 15 codex exec 'reply with OK only'`
|
||||
- result: clean JSON-RPC/session stream ended with `agent_message_chunk: "OK"`, `id:2 result:{stopReason:"end_turn"}`, process `exit=0`
|
||||
- Claude command:
|
||||
- `.../acpx --format json --json-strict --timeout 20 claude exec 'reply with OK only'`
|
||||
- stdout included top-level JSON-RPC errors:
|
||||
- `{"jsonrpc":"2.0","id":2,"error":{"code":-32000,"message":"Authentication required"}}`
|
||||
- `{"jsonrpc":"2.0","id":null,"error":{"code":-32000,"message":"Authentication required"}}`
|
||||
- process still exited `0`
|
||||
- Source-level finding in `external/openclaw-upstream/extensions/acpx/src/runtime-internals/events.ts`:
|
||||
- prompt parsing handled typed `{type:"error"}` lines but dropped top-level JSON-RPC `error` responses
|
||||
- that meant `runtime.runTurn()` could treat a Claude auth failure as success (`done`) when the agent emitted JSON-RPC errors yet exited cleanly
|
||||
- Implemented focused upstream fix on branch `fix/subagent-wait-error-outcome`:
|
||||
- `extensions/acpx/src/runtime-internals/events.ts`
|
||||
- `toAcpxErrorEvent()` now also recognizes top-level JSON-RPC `error` responses via `parseControlJsonError()`
|
||||
- `parsePromptEventLine()` now emits ACP runtime `type:"error"` events for that shape instead of dropping it
|
||||
- added regression coverage:
|
||||
- `extensions/acpx/src/runtime-internals/events.test.ts`
|
||||
- `extensions/acpx/src/runtime-internals/test-fixtures.ts`
|
||||
- `extensions/acpx/src/runtime.test.ts`
|
||||
- Targeted validation passed:
|
||||
- `cd /home/openclaw/.openclaw/workspace/external/openclaw-upstream && pnpm exec vitest run extensions/acpx/src/runtime-internals/events.test.ts extensions/acpx/src/runtime.test.ts extensions/acpx/src/runtime-internals/control-errors.test.ts`
|
||||
- result: `22` tests passed across `3` files
|
||||
- Net status after this pass:
|
||||
- old `acpx exited with code 1/5` reports remain historical evidence only
|
||||
- Codex ACP direct runtime path works today
|
||||
- Claude ACP direct runtime path currently fails for auth, and OpenClaw had a real bug in how the bundled acpx runtime parsed that failure shape
|
||||
- remaining follow-up is end-to-end OpenClaw ACP-path validation once ACP is explicitly configured here (or if a fresh exit-code repro appears)
|
||||
- Will also explicitly requested that zap keep a light eye on active subagents and check whether they look stuck instead of assuming they are fine until completion.
|
||||
- Will explicitly reinforced on 2026-03-13 that once planning is done, zap should use subagents ASAP and start implementation in a fresh session rather than continuing to implement inside the long-lived main chat.
|
||||
- Will explicitly asked on 2026-03-13 for more frequent checks on active subagent runs; zap should inspect/steer sooner instead of waiting for long silent stretches.
|
||||
+12
-4
@@ -5,11 +5,14 @@
|
||||
"title": "Fix ACP runtime failures for Claude Code and Codex agents",
|
||||
"owner": "zap",
|
||||
"priority": "high",
|
||||
"status": "open",
|
||||
"details": "Both ACP runs failed during this session (Claude: acpx exited with code 1, Codex: acpx exited with code 5). Investigate acpx/ACP runtime failure path and restore reliable delegation for claude/codex agents.",
|
||||
"status": "in-progress",
|
||||
"details": "Historical evidence said Claude/Codex ACP runs failed with `acpx exited with code 1/5`. Latest focused pass narrowed the live issue: direct bundled `acpx` now shows Codex working, while Claude returns top-level JSON-RPC `Authentication required` errors and exits 0. A focused upstream fix now makes the bundled acpx runtime surface those JSON-RPC prompt errors instead of silently treating them as success. Remaining work: validate through the real OpenClaw ACP session path once ACP is explicitly configured here, or capture a fresh repro of the older exit-code crashes.",
|
||||
"notes": [
|
||||
"Reported by Will on 2026-03-04.",
|
||||
"Added as deferred follow-up while immediate LiteLLM route fix was applied directly."
|
||||
"Added as deferred follow-up while immediate LiteLLM route fix was applied directly.",
|
||||
"2026-03-13 follow-up: exact historical `acpx exited with code 1/5` crashes were not reproduced. Live direct bundled-acpx repros showed Codex success and Claude top-level JSON-RPC auth errors with clean exit 0.",
|
||||
"2026-03-13 follow-up: fixed bundled acpx prompt parsing in external/openclaw-upstream so top-level JSON-RPC error responses now emit ACP runtime error events instead of being dropped. Targeted validation passed: 22 tests across events/control-errors/runtime test files.",
|
||||
"2026-03-13 remaining step: validate the fix through a real OpenClaw ACP session once `acp`/`acpx` is explicitly enabled in local config, or wait for a fresh end-to-end repro of the older exit-code failures."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -26,7 +29,12 @@
|
||||
"Implemented local TUI filtering patch in openclaw dist to suppress internal runtime completion context blocks (tui-LeOEBhMz.js).",
|
||||
"Patch timestamp: 2026-03-04T22:31:50Z",
|
||||
"Upstream patch committed in external/openclaw-upstream on branch fix/tui-hide-internal-runtime-context commit 0f66a4547 (suppress internal runtime completion context blocks in TUI formatter).",
|
||||
"Validation: pnpm test:fast completed successfully (812 files / 6599 tests passing) at 2026-03-04T22:53:29Z"
|
||||
"Validation: pnpm test:fast completed successfully (812 files / 6599 tests passing) at 2026-03-04T22:53:29Z",
|
||||
"2026-03-13: confirmed corrected LiteLLM run was still failing (child transcript showed assistant 429/plan error for GLM-5) while runs.json incorrectly stored outcome.status=ok and frozenResultText=null; implemented upstream branch fix/subagent-wait-error-outcome to derive terminal subagent outcome from latest assistant error state, with targeted validation (50 tests passed across 3 files).",
|
||||
"2026-03-13 later: live gpt-5.4 success repro passed (run 23750d80-b481-4f50-b219-cc9245be405f). Live gpt-5.4 failure repro also passed for subagent persistence/announcement handling: child run b50cb91f-6219-44f7-9d2f-a1264ac7ceaf ended with transcript stopReason=error + context_length_exceeded, and runs.json now stored outcome.status=error / endedReason=subagent-error / frozenResultText non-null. Remaining open nuance: raw agent.wait for that same failed child still returned status=ok.",
|
||||
"2026-03-13 later: traced raw agent.wait=status:ok-on-terminal-error to an upstream bug in commands/agent.ts fallback lifecycle emission (phase:end emitted even when resolved run meta.stopReason=error). Added focused upstream fix plus dedupe-path handling/tests on branch fix/subagent-wait-error-outcome; targeted validation passed (81 tests across commands/agent.test.ts, gateway/server-methods/agent-wait-dedupe.test.ts, gateway/server-methods/server-methods.test.ts). Live verification of the new agent.wait behavior remains open.",
|
||||
"2026-03-13 final live pass: a fresh source-run gateway on port 18902 still returned agent.wait status=ok for run gwc-live-agent-wait-gpt53-source-1773427981614 even though the same session's terminal assistant message had provider=openai-codex model=gpt-5.3-codex stopReason=error with context_length_exceeded. Most likely remaining gap: an inner lifecycle emitter still marks the live direct gateway path as phase:end early enough that waitForAgentJob resolves ok before dedupe/result-meta rescue logic can win.",
|
||||
"2026-03-13 final focused pass: closed the remaining raw agent.wait bug. Root cause was the live direct gateway path receiving agent_end with a terminal assistant error but no preceding message_end, leaving stale assistant state and still emitting lifecycle phase:end. Final fix updated embedded subscribe lifecycle handling to recover terminal assistant errors from agent_end/session state and updated gateway server-methods/agent.ts to derive final RPC status from observed lifecycle + resolved result metadata. Validation passed (108 tests across 5 files). Live source-gateway repro on port 18903 then returned finalStatus:error and agent.wait status:error for run gwc-live-agent-wait-gpt53-source-fixed2-1773429512008."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
||||
+231
@@ -0,0 +1,231 @@
|
||||
#!/usr/bin/env node
|
||||
const fs = require('node:fs');
|
||||
const path = require('node:path');
|
||||
const os = require('node:os');
|
||||
|
||||
function resolveOpenClawPackageRoot() {
|
||||
const wrapperPath = path.join(os.homedir(), '.local', 'bin', 'openclaw');
|
||||
const wrapper = fs.readFileSync(wrapperPath, 'utf8');
|
||||
const match = wrapper.match(/"([^"]*node_modules\/openclaw)\/openclaw\.mjs"/);
|
||||
if (!match) throw new Error(`Could not resolve openclaw package root from ${wrapperPath}`);
|
||||
const raw = match[1];
|
||||
if (raw.startsWith('$basedir/')) {
|
||||
return path.resolve(path.dirname(wrapperPath), raw.replace(/^\$basedir\//, ''));
|
||||
}
|
||||
return raw;
|
||||
}
|
||||
|
||||
function replaceOnce(content, oldText, newText, label, filePath) {
|
||||
if (content.includes(newText)) return { content, changed: false, already: true };
|
||||
if (!content.includes(oldText)) {
|
||||
throw new Error(`Patch block not found for ${label} in ${filePath}`);
|
||||
}
|
||||
return { content: content.replace(oldText, newText), changed: true, already: false };
|
||||
}
|
||||
|
||||
function ensureDir(dir) {
|
||||
fs.mkdirSync(dir, { recursive: true });
|
||||
}
|
||||
|
||||
const extractAssistantTextOld = `function extractAssistantText(message) {
|
||||
\tif (!message || typeof message !== "object") return;
|
||||
\tif (message.role !== "assistant") return;
|
||||
\tconst content = message.content;
|
||||
\tif (!Array.isArray(content)) return;
|
||||
\tconst joined = extractTextFromChatContent(content, {
|
||||
\t\tsanitizeText: sanitizeTextContent,
|
||||
\t\tjoinWith: "",
|
||||
\t\tnormalizeText: (text) => text.trim()
|
||||
\t}) ?? "";
|
||||
\tconst stopReason = message.stopReason;
|
||||
\tconst errorMessage = message.errorMessage;
|
||||
\tconst errorContext = stopReason === "error" || typeof errorMessage === "string" && Boolean(errorMessage.trim());
|
||||
\treturn joined ? sanitizeUserFacingText(joined, { errorContext }) : void 0;
|
||||
}`;
|
||||
|
||||
const extractAssistantTextNew = `function extractAssistantText(message) {
|
||||
\tif (!message || typeof message !== "object") return;
|
||||
\tif (message.role !== "assistant") return;
|
||||
\tconst content = message.content;
|
||||
\tif (!Array.isArray(content)) return;
|
||||
\tconst joined = extractTextFromChatContent(content, {
|
||||
\t\tsanitizeText: sanitizeTextContent,
|
||||
\t\tjoinWith: "",
|
||||
\t\tnormalizeText: (text) => text.trim()
|
||||
\t}) ?? "";
|
||||
\tconst stopReason = message.stopReason;
|
||||
\tconst errorMessage = message.errorMessage;
|
||||
\tconst errorContext = stopReason === "error" || typeof errorMessage === "string" && Boolean(errorMessage.trim());
|
||||
\treturn joined ? sanitizeUserFacingText(joined, { errorContext }) : void 0;
|
||||
}
|
||||
function extractAssistantTerminalText(message) {
|
||||
\tif (!message || typeof message !== "object") return { isError: false };
|
||||
\tif (message.role !== "assistant") return { isError: false };
|
||||
\tconst stopReason = message.stopReason;
|
||||
\tconst rawErrorMessage = message.errorMessage;
|
||||
\tconst isError = stopReason === "error" || typeof rawErrorMessage === "string" && Boolean(rawErrorMessage.trim());
|
||||
\tconst text = extractAssistantText(message);
|
||||
\tif (text?.trim()) return { text, isError };
|
||||
\tif (typeof rawErrorMessage === "string" && rawErrorMessage.trim()) {
|
||||
\t\treturn {
|
||||
\t\t\ttext: sanitizeUserFacingText(rawErrorMessage.trim(), { errorContext: true }),
|
||||
\t\t\tisError: true
|
||||
\t\t};
|
||||
\t}
|
||||
\treturn { isError };
|
||||
}`;
|
||||
|
||||
const readLatestAssistantReplyOld = `async function readLatestAssistantReply(params) {
|
||||
\tconst history = await callGateway({
|
||||
\t\tmethod: "chat.history",
|
||||
\t\tparams: {
|
||||
\t\t\tsessionKey: params.sessionKey,
|
||||
\t\t\tlimit: params.limit ?? 50
|
||||
\t\t}
|
||||
\t});
|
||||
\tconst filtered = stripToolMessages(Array.isArray(history?.messages) ? history.messages : []);
|
||||
\tfor (let i = filtered.length - 1; i >= 0; i -= 1) {
|
||||
\t\tconst candidate = filtered[i];
|
||||
\t\tif (!candidate || typeof candidate !== "object") continue;
|
||||
\t\tif (candidate.role !== "assistant") continue;
|
||||
\t\tconst text = extractAssistantText(candidate);
|
||||
\t\tif (!text?.trim()) continue;
|
||||
\t\treturn text;
|
||||
\t}
|
||||
}`;
|
||||
|
||||
const readLatestAssistantReplyNew = `async function readLatestAssistantOutcome(params) {
|
||||
\tconst history = await callGateway({
|
||||
\t\tmethod: "chat.history",
|
||||
\t\tparams: {
|
||||
\t\t\tsessionKey: params.sessionKey,
|
||||
\t\t\tlimit: params.limit ?? 50
|
||||
\t\t}
|
||||
\t});
|
||||
\tconst filtered = stripToolMessages(Array.isArray(history?.messages) ? history.messages : []);
|
||||
\tfor (let i = filtered.length - 1; i >= 0; i -= 1) {
|
||||
\t\tconst candidate = filtered[i];
|
||||
\t\tif (!candidate || typeof candidate !== "object") continue;
|
||||
\t\tif (candidate.role !== "assistant") continue;
|
||||
\t\treturn extractAssistantTerminalText(candidate);
|
||||
\t}
|
||||
\treturn { isError: false };
|
||||
}
|
||||
async function readLatestAssistantReply(params) {
|
||||
\tconst outcome = await readLatestAssistantOutcome(params);
|
||||
\treturn outcome.text?.trim() ? outcome.text : void 0;
|
||||
}`;
|
||||
|
||||
const waitOutcomeOld = `\t\tconst waitError = typeof wait.error === "string" ? wait.error : void 0;
|
||||
\t\tconst outcome = wait.status === "error" ? {
|
||||
\t\t\tstatus: "error",
|
||||
\t\t\terror: waitError
|
||||
\t\t} : wait.status === "timeout" ? { status: "timeout" } : { status: "ok" };
|
||||
\t\tif (!runOutcomesEqual(entry.outcome, outcome)) {
|
||||
\t\t\tentry.outcome = outcome;
|
||||
\t\t\tmutated = true;
|
||||
\t\t}
|
||||
\t\tif (mutated) persistSubagentRuns();
|
||||
\t\tawait completeSubagentRun({
|
||||
\t\t\trunId,
|
||||
\t\t\tendedAt: entry.endedAt,
|
||||
\t\t\toutcome,
|
||||
\t\t\treason: wait.status === "error" ? SUBAGENT_ENDED_REASON_ERROR : SUBAGENT_ENDED_REASON_COMPLETE,
|
||||
\t\t\tsendFarewell: true,
|
||||
\t\t\taccountId: entry.requesterOrigin?.accountId,
|
||||
\t\t\ttriggerCleanup: true
|
||||
\t\t});`;
|
||||
|
||||
const waitOutcomeNew = `\t\tconst waitError = typeof wait.error === "string" ? wait.error : void 0;
|
||||
\t\tlet outcome = wait.status === "error" ? {
|
||||
\t\t\tstatus: "error",
|
||||
\t\t\terror: waitError
|
||||
\t\t} : wait.status === "timeout" ? { status: "timeout" } : { status: "ok" };
|
||||
\t\tif (outcome.status === "ok") try {
|
||||
\t\t\tconst latestAssistant = await readLatestAssistantOutcome({
|
||||
\t\t\t\tsessionKey: entry.childSessionKey,
|
||||
\t\t\t\tlimit: 50
|
||||
\t\t\t});
|
||||
\t\t\tif (latestAssistant.isError) outcome = {
|
||||
\t\t\t\tstatus: "error",
|
||||
\t\t\t\terror: latestAssistant.text?.trim() || waitError
|
||||
\t\t\t};
|
||||
\t\t} catch {}
|
||||
\t\tif (!runOutcomesEqual(entry.outcome, outcome)) {
|
||||
\t\t\tentry.outcome = outcome;
|
||||
\t\t\tmutated = true;
|
||||
\t\t}
|
||||
\t\tif (mutated) persistSubagentRuns();
|
||||
\t\tawait completeSubagentRun({
|
||||
\t\t\trunId,
|
||||
\t\t\tendedAt: entry.endedAt,
|
||||
\t\t\toutcome,
|
||||
\t\t\treason: outcome.status === "error" ? SUBAGENT_ENDED_REASON_ERROR : SUBAGENT_ENDED_REASON_COMPLETE,
|
||||
\t\t\tsendFarewell: true,
|
||||
\t\t\taccountId: entry.requesterOrigin?.accountId,
|
||||
\t\t\ttriggerCleanup: true
|
||||
\t\t});`;
|
||||
|
||||
const announceGuardOld = `\t\tif (!outcome) outcome = { status: "unknown" };`;
|
||||
const announceGuardNew = `\t\tif (outcome?.status === "ok") try {
|
||||
\t\t\tconst latestAssistant = await readLatestAssistantOutcome({
|
||||
\t\t\t\tsessionKey: params.childSessionKey,
|
||||
\t\t\t\tlimit: 50
|
||||
\t\t\t});
|
||||
\t\t\tif (latestAssistant.isError) {
|
||||
\t\t\t\tif (!reply?.trim() && latestAssistant.text?.trim()) reply = latestAssistant.text;
|
||||
\t\t\t\toutcome = {
|
||||
\t\t\t\t\tstatus: "error",
|
||||
\t\t\t\t\terror: latestAssistant.text?.trim() || outcome.error
|
||||
\t\t\t\t};
|
||||
\t\t\t}
|
||||
\t\t} catch {}
|
||||
\t\tif (!outcome) outcome = { status: "unknown" };`;
|
||||
|
||||
function main() {
|
||||
const pkgRoot = resolveOpenClawPackageRoot();
|
||||
const targets = [
|
||||
path.join(pkgRoot, 'dist', 'reply-DeXK9BLT.js'),
|
||||
path.join(pkgRoot, 'dist', 'compact-D3emcZgv.js'),
|
||||
path.join(pkgRoot, 'dist', 'pi-embedded-CrsFdYam.js'),
|
||||
path.join(pkgRoot, 'dist', 'pi-embedded-jHMb7qEG.js'),
|
||||
path.join(pkgRoot, 'dist', 'plugin-sdk', 'dispatch-CJdFmoH9.js'),
|
||||
].filter((file) => fs.existsSync(file));
|
||||
|
||||
const backupRoot = path.join(os.homedir(), '.openclaw', 'workspace', 'tmp', 'openclaw-subagent-outcome-hotfix');
|
||||
const stamp = new Date().toISOString().replace(/[:.]/g, '-');
|
||||
const thisBackupDir = path.join(backupRoot, stamp);
|
||||
let touched = 0;
|
||||
|
||||
for (const file of targets) {
|
||||
let content = fs.readFileSync(file, 'utf8');
|
||||
let changed = false;
|
||||
|
||||
for (const [label, oldText, newText] of [
|
||||
['extractAssistantTerminalText', extractAssistantTextOld, extractAssistantTextNew],
|
||||
['readLatestAssistantOutcome', readLatestAssistantReplyOld, readLatestAssistantReplyNew],
|
||||
['wait outcome downgrade', waitOutcomeOld, waitOutcomeNew],
|
||||
['announce error guard', announceGuardOld, announceGuardNew],
|
||||
]) {
|
||||
const result = replaceOnce(content, oldText, newText, label, file);
|
||||
content = result.content;
|
||||
changed = changed || result.changed;
|
||||
}
|
||||
|
||||
if (changed) {
|
||||
ensureDir(thisBackupDir);
|
||||
const backupPath = path.join(thisBackupDir, path.basename(file));
|
||||
fs.copyFileSync(file, backupPath);
|
||||
fs.writeFileSync(file, content, 'utf8');
|
||||
touched += 1;
|
||||
console.log(`patched ${file}`);
|
||||
} else {
|
||||
console.log(`already patched ${file}`);
|
||||
}
|
||||
}
|
||||
|
||||
console.log(`done; touched ${touched} file(s)`);
|
||||
if (touched > 0) console.log(`backup: ${thisBackupDir}`);
|
||||
}
|
||||
|
||||
main();
|
||||
Reference in New Issue
Block a user