Add RAG JSON-to-text transformation plan

Design for improving semantic search quality by transforming JSON structures into natural language at index time. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-07 11:11:34 -08:00
parent c14c0d843d
commit c82726b691
1 changed files with 110 additions and 0 deletions
--- a/plans/temporal-foraging-milner.md
+++ b/plans/temporal-foraging-milner.md
@@ -0,0 +1,110 @@
 # Plan: Improve RAG Personal Index JSON-to-Natural-Language Transformation
 ## Problem
 The RAG personal index produces low-quality matches for semantic queries because it indexes raw JSON structure rather than natural language.
 **Example failure:**
 - Query: "how to add a new agent"
 - Expected: Match `system-instructions.json` → `processes.agent-lifecycle.add`
 - Actual: Score 0.479, returns generic agent mentions instead
 **Root cause:** The chunker doesn't recognize process structures with `add`/`remove`/`rules`/`requirements` arrays, so they fall through to raw JSON stringification.
 ## Solution
 Enhance `index_personal.py` to transform JSON structures into natural language at index time.
 ## Files to Modify
 1. `~/.claude/skills/rag-search/scripts/index_personal.py` - Main changes
 ## Implementation
 ### 1. Add Process Pattern Recognition (lines ~127-138)
 Add handling for process objects with action arrays:
 ```python
 # Process with action arrays (add, remove, rules, requirements, etc.)
 action_keys = ["add", "remove", "rules", "requirements", "steps", "validate"]
 if any(key in item for key in action_keys):
    parts = []
    if context:
        parts.append(f"{context}:")
    if item.get("description"):
        parts.append(item["description"])
    for action_key in action_keys:
        if action_key in item and isinstance(item[action_key], list):
            action_text = f"To {action_key}: " + ". ".join(item[action_key])
            parts.append(action_text)
    if parts:
        yield (" ".join(parts), {**base_metadata, "process": context})
        return
 ```
 ### 2. Improve Context Propagation
 When processing nested dicts, pass richer context:
 ```python
 # In the top-level dict processing (line ~154-161)
 elif isinstance(value, dict):
    # Pass the key as context for better chunk text
    yield from process_item(value, context=key)
 ```
 Already done, but ensure action arrays get the context.
 ### 3. Handle Key-Value Pairs in Processes
 For structures like:
 ```json
 "content-principles": {
  "no-redundancy": "Information lives in one authoritative location",
  "lean-files": "Keep files concise..."
 }
 ```
 Transform to: `"content-principles: no-redundancy means information lives in one authoritative location. lean-files means keep files concise..."`
 ### 4. Add Tests
 Create a simple test to verify transformation quality:
 ```bash
 # After reindex, verify the failing query now works
 ~/.claude/skills/rag-search/scripts/search.py "how to add a new agent" --index personal
 # Should return system-instructions.json with score > 0.7
 ```
 ## Expected Outcome
 | Query | Before | After |
 |-------|--------|-------|
 | "how to add a new agent" | 0.479, wrong file | >0.7, system-instructions.json |
 | "agent lifecycle" | Similar | Better match to process |
 | "model selection rules" | Depends | Match model-selection process |
 ## Validation Steps
 1. Run modified indexer
 2. Test the three queries above
 3. Compare scores and result relevance
 ## Rollback
 If results degrade: `git checkout scripts/index_personal.py && reindex`
 ## Post-Implementation
 Add to `future-considerations.json`:
 - RAG indexer debug/verbose mode to inspect what text is being indexed
 ## Future Considerations (Deferred)
 - Natural language templates per JSON schema type
 - LLM-generated summaries of complex structures
 - Caching transformed text alongside original JSON