From c82726b691ebcd6324601e139df282d27d2353e2 Mon Sep 17 00:00:00 2001 From: OpenCode Test Date: Wed, 7 Jan 2026 11:11:34 -0800 Subject: [PATCH] Add RAG JSON-to-text transformation plan MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Design for improving semantic search quality by transforming JSON structures into natural language at index time. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 --- plans/temporal-foraging-milner.md | 110 ++++++++++++++++++++++++++++++ 1 file changed, 110 insertions(+) create mode 100644 plans/temporal-foraging-milner.md diff --git a/plans/temporal-foraging-milner.md b/plans/temporal-foraging-milner.md new file mode 100644 index 0000000..d22eac0 --- /dev/null +++ b/plans/temporal-foraging-milner.md @@ -0,0 +1,110 @@ +# Plan: Improve RAG Personal Index JSON-to-Natural-Language Transformation + +## Problem + +The RAG personal index produces low-quality matches for semantic queries because it indexes raw JSON structure rather than natural language. + +**Example failure:** +- Query: "how to add a new agent" +- Expected: Match `system-instructions.json` → `processes.agent-lifecycle.add` +- Actual: Score 0.479, returns generic agent mentions instead + +**Root cause:** The chunker doesn't recognize process structures with `add`/`remove`/`rules`/`requirements` arrays, so they fall through to raw JSON stringification. + +## Solution + +Enhance `index_personal.py` to transform JSON structures into natural language at index time. + +## Files to Modify + +1. `~/.claude/skills/rag-search/scripts/index_personal.py` - Main changes + +## Implementation + +### 1. Add Process Pattern Recognition (lines ~127-138) + +Add handling for process objects with action arrays: + +```python +# Process with action arrays (add, remove, rules, requirements, etc.) +action_keys = ["add", "remove", "rules", "requirements", "steps", "validate"] +if any(key in item for key in action_keys): + parts = [] + if context: + parts.append(f"{context}:") + if item.get("description"): + parts.append(item["description"]) + + for action_key in action_keys: + if action_key in item and isinstance(item[action_key], list): + action_text = f"To {action_key}: " + ". ".join(item[action_key]) + parts.append(action_text) + + if parts: + yield (" ".join(parts), {**base_metadata, "process": context}) + return +``` + +### 2. Improve Context Propagation + +When processing nested dicts, pass richer context: + +```python +# In the top-level dict processing (line ~154-161) +elif isinstance(value, dict): + # Pass the key as context for better chunk text + yield from process_item(value, context=key) +``` + +Already done, but ensure action arrays get the context. + +### 3. Handle Key-Value Pairs in Processes + +For structures like: +```json +"content-principles": { + "no-redundancy": "Information lives in one authoritative location", + "lean-files": "Keep files concise..." +} +``` + +Transform to: `"content-principles: no-redundancy means information lives in one authoritative location. lean-files means keep files concise..."` + +### 4. Add Tests + +Create a simple test to verify transformation quality: + +```bash +# After reindex, verify the failing query now works +~/.claude/skills/rag-search/scripts/search.py "how to add a new agent" --index personal +# Should return system-instructions.json with score > 0.7 +``` + +## Expected Outcome + +| Query | Before | After | +|-------|--------|-------| +| "how to add a new agent" | 0.479, wrong file | >0.7, system-instructions.json | +| "agent lifecycle" | Similar | Better match to process | +| "model selection rules" | Depends | Match model-selection process | + +## Validation Steps + +1. Run modified indexer +2. Test the three queries above +3. Compare scores and result relevance + +## Rollback + +If results degrade: `git checkout scripts/index_personal.py && reindex` + +## Post-Implementation + +Add to `future-considerations.json`: +- RAG indexer debug/verbose mode to inspect what text is being indexed + +## Future Considerations (Deferred) + +- Natural language templates per JSON schema type +- LLM-generated summaries of complex structures +- Caching transformed text alongside original JSON