Add RAG JSON-to-text transformation plan
Design for improving semantic search quality by transforming JSON structures into natural language at index time. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
110
plans/temporal-foraging-milner.md
Normal file
110
plans/temporal-foraging-milner.md
Normal file
@@ -0,0 +1,110 @@
|
|||||||
|
# Plan: Improve RAG Personal Index JSON-to-Natural-Language Transformation
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
|
||||||
|
The RAG personal index produces low-quality matches for semantic queries because it indexes raw JSON structure rather than natural language.
|
||||||
|
|
||||||
|
**Example failure:**
|
||||||
|
- Query: "how to add a new agent"
|
||||||
|
- Expected: Match `system-instructions.json` → `processes.agent-lifecycle.add`
|
||||||
|
- Actual: Score 0.479, returns generic agent mentions instead
|
||||||
|
|
||||||
|
**Root cause:** The chunker doesn't recognize process structures with `add`/`remove`/`rules`/`requirements` arrays, so they fall through to raw JSON stringification.
|
||||||
|
|
||||||
|
## Solution
|
||||||
|
|
||||||
|
Enhance `index_personal.py` to transform JSON structures into natural language at index time.
|
||||||
|
|
||||||
|
## Files to Modify
|
||||||
|
|
||||||
|
1. `~/.claude/skills/rag-search/scripts/index_personal.py` - Main changes
|
||||||
|
|
||||||
|
## Implementation
|
||||||
|
|
||||||
|
### 1. Add Process Pattern Recognition (lines ~127-138)
|
||||||
|
|
||||||
|
Add handling for process objects with action arrays:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Process with action arrays (add, remove, rules, requirements, etc.)
|
||||||
|
action_keys = ["add", "remove", "rules", "requirements", "steps", "validate"]
|
||||||
|
if any(key in item for key in action_keys):
|
||||||
|
parts = []
|
||||||
|
if context:
|
||||||
|
parts.append(f"{context}:")
|
||||||
|
if item.get("description"):
|
||||||
|
parts.append(item["description"])
|
||||||
|
|
||||||
|
for action_key in action_keys:
|
||||||
|
if action_key in item and isinstance(item[action_key], list):
|
||||||
|
action_text = f"To {action_key}: " + ". ".join(item[action_key])
|
||||||
|
parts.append(action_text)
|
||||||
|
|
||||||
|
if parts:
|
||||||
|
yield (" ".join(parts), {**base_metadata, "process": context})
|
||||||
|
return
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Improve Context Propagation
|
||||||
|
|
||||||
|
When processing nested dicts, pass richer context:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# In the top-level dict processing (line ~154-161)
|
||||||
|
elif isinstance(value, dict):
|
||||||
|
# Pass the key as context for better chunk text
|
||||||
|
yield from process_item(value, context=key)
|
||||||
|
```
|
||||||
|
|
||||||
|
Already done, but ensure action arrays get the context.
|
||||||
|
|
||||||
|
### 3. Handle Key-Value Pairs in Processes
|
||||||
|
|
||||||
|
For structures like:
|
||||||
|
```json
|
||||||
|
"content-principles": {
|
||||||
|
"no-redundancy": "Information lives in one authoritative location",
|
||||||
|
"lean-files": "Keep files concise..."
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Transform to: `"content-principles: no-redundancy means information lives in one authoritative location. lean-files means keep files concise..."`
|
||||||
|
|
||||||
|
### 4. Add Tests
|
||||||
|
|
||||||
|
Create a simple test to verify transformation quality:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# After reindex, verify the failing query now works
|
||||||
|
~/.claude/skills/rag-search/scripts/search.py "how to add a new agent" --index personal
|
||||||
|
# Should return system-instructions.json with score > 0.7
|
||||||
|
```
|
||||||
|
|
||||||
|
## Expected Outcome
|
||||||
|
|
||||||
|
| Query | Before | After |
|
||||||
|
|-------|--------|-------|
|
||||||
|
| "how to add a new agent" | 0.479, wrong file | >0.7, system-instructions.json |
|
||||||
|
| "agent lifecycle" | Similar | Better match to process |
|
||||||
|
| "model selection rules" | Depends | Match model-selection process |
|
||||||
|
|
||||||
|
## Validation Steps
|
||||||
|
|
||||||
|
1. Run modified indexer
|
||||||
|
2. Test the three queries above
|
||||||
|
3. Compare scores and result relevance
|
||||||
|
|
||||||
|
## Rollback
|
||||||
|
|
||||||
|
If results degrade: `git checkout scripts/index_personal.py && reindex`
|
||||||
|
|
||||||
|
## Post-Implementation
|
||||||
|
|
||||||
|
Add to `future-considerations.json`:
|
||||||
|
- RAG indexer debug/verbose mode to inspect what text is being indexed
|
||||||
|
|
||||||
|
## Future Considerations (Deferred)
|
||||||
|
|
||||||
|
- Natural language templates per JSON schema type
|
||||||
|
- LLM-generated summaries of complex structures
|
||||||
|
- Caching transformed text alongside original JSON
|
||||||
Reference in New Issue
Block a user