Design for improving semantic search quality by transforming JSON structures into natural language at index time. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
3.3 KiB
Plan: Improve RAG Personal Index JSON-to-Natural-Language Transformation
Problem
The RAG personal index produces low-quality matches for semantic queries because it indexes raw JSON structure rather than natural language.
Example failure:
- Query: "how to add a new agent"
- Expected: Match
system-instructions.json→processes.agent-lifecycle.add - Actual: Score 0.479, returns generic agent mentions instead
Root cause: The chunker doesn't recognize process structures with add/remove/rules/requirements arrays, so they fall through to raw JSON stringification.
Solution
Enhance index_personal.py to transform JSON structures into natural language at index time.
Files to Modify
~/.claude/skills/rag-search/scripts/index_personal.py- Main changes
Implementation
1. Add Process Pattern Recognition (lines ~127-138)
Add handling for process objects with action arrays:
# Process with action arrays (add, remove, rules, requirements, etc.)
action_keys = ["add", "remove", "rules", "requirements", "steps", "validate"]
if any(key in item for key in action_keys):
parts = []
if context:
parts.append(f"{context}:")
if item.get("description"):
parts.append(item["description"])
for action_key in action_keys:
if action_key in item and isinstance(item[action_key], list):
action_text = f"To {action_key}: " + ". ".join(item[action_key])
parts.append(action_text)
if parts:
yield (" ".join(parts), {**base_metadata, "process": context})
return
2. Improve Context Propagation
When processing nested dicts, pass richer context:
# In the top-level dict processing (line ~154-161)
elif isinstance(value, dict):
# Pass the key as context for better chunk text
yield from process_item(value, context=key)
Already done, but ensure action arrays get the context.
3. Handle Key-Value Pairs in Processes
For structures like:
"content-principles": {
"no-redundancy": "Information lives in one authoritative location",
"lean-files": "Keep files concise..."
}
Transform to: "content-principles: no-redundancy means information lives in one authoritative location. lean-files means keep files concise..."
4. Add Tests
Create a simple test to verify transformation quality:
# After reindex, verify the failing query now works
~/.claude/skills/rag-search/scripts/search.py "how to add a new agent" --index personal
# Should return system-instructions.json with score > 0.7
Expected Outcome
| Query | Before | After |
|---|---|---|
| "how to add a new agent" | 0.479, wrong file | >0.7, system-instructions.json |
| "agent lifecycle" | Similar | Better match to process |
| "model selection rules" | Depends | Match model-selection process |
Validation Steps
- Run modified indexer
- Test the three queries above
- Compare scores and result relevance
Rollback
If results degrade: git checkout scripts/index_personal.py && reindex
Post-Implementation
Add to future-considerations.json:
- RAG indexer debug/verbose mode to inspect what text is being indexed
Future Considerations (Deferred)
- Natural language templates per JSON schema type
- LLM-generated summaries of complex structures
- Caching transformed text alongside original JSON