# Plan: Improve RAG Personal Index JSON-to-Natural-Language Transformation ## Problem The RAG personal index produces low-quality matches for semantic queries because it indexes raw JSON structure rather than natural language. **Example failure:** - Query: "how to add a new agent" - Expected: Match `system-instructions.json` → `processes.agent-lifecycle.add` - Actual: Score 0.479, returns generic agent mentions instead **Root cause:** The chunker doesn't recognize process structures with `add`/`remove`/`rules`/`requirements` arrays, so they fall through to raw JSON stringification. ## Solution Enhance `index_personal.py` to transform JSON structures into natural language at index time. ## Files to Modify 1. `~/.claude/skills/rag-search/scripts/index_personal.py` - Main changes ## Implementation ### 1. Add Process Pattern Recognition (lines ~127-138) Add handling for process objects with action arrays: ```python # Process with action arrays (add, remove, rules, requirements, etc.) action_keys = ["add", "remove", "rules", "requirements", "steps", "validate"] if any(key in item for key in action_keys): parts = [] if context: parts.append(f"{context}:") if item.get("description"): parts.append(item["description"]) for action_key in action_keys: if action_key in item and isinstance(item[action_key], list): action_text = f"To {action_key}: " + ". ".join(item[action_key]) parts.append(action_text) if parts: yield (" ".join(parts), {**base_metadata, "process": context}) return ``` ### 2. Improve Context Propagation When processing nested dicts, pass richer context: ```python # In the top-level dict processing (line ~154-161) elif isinstance(value, dict): # Pass the key as context for better chunk text yield from process_item(value, context=key) ``` Already done, but ensure action arrays get the context. ### 3. Handle Key-Value Pairs in Processes For structures like: ```json "content-principles": { "no-redundancy": "Information lives in one authoritative location", "lean-files": "Keep files concise..." } ``` Transform to: `"content-principles: no-redundancy means information lives in one authoritative location. lean-files means keep files concise..."` ### 4. Add Tests Create a simple test to verify transformation quality: ```bash # After reindex, verify the failing query now works ~/.claude/skills/rag-search/scripts/search.py "how to add a new agent" --index personal # Should return system-instructions.json with score > 0.7 ``` ## Expected Outcome | Query | Before | After | |-------|--------|-------| | "how to add a new agent" | 0.479, wrong file | >0.7, system-instructions.json | | "agent lifecycle" | Similar | Better match to process | | "model selection rules" | Depends | Match model-selection process | ## Validation Steps 1. Run modified indexer 2. Test the three queries above 3. Compare scores and result relevance ## Rollback If results degrade: `git checkout scripts/index_personal.py && reindex` ## Post-Implementation Add to `future-considerations.json`: - RAG indexer debug/verbose mode to inspect what text is being indexed ## Future Considerations (Deferred) - Natural language templates per JSON schema type - LLM-generated summaries of complex structures - Caching transformed text alongside original JSON