Files

OpenCode Test c82726b691 Add RAG JSON-to-text transformation plan

Design for improving semantic search quality by transforming JSON
structures into natural language at index time.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-07 11:11:34 -08:00

3.3 KiB

Raw Blame History

Plan: Improve RAG Personal Index JSON-to-Natural-Language Transformation

Problem

The RAG personal index produces low-quality matches for semantic queries because it indexes raw JSON structure rather than natural language.

Example failure:

Query: "how to add a new agent"
Expected: Match system-instructions.json → processes.agent-lifecycle.add
Actual: Score 0.479, returns generic agent mentions instead

Root cause: The chunker doesn't recognize process structures with add/remove/rules/requirements arrays, so they fall through to raw JSON stringification.

Solution

Enhance index_personal.py to transform JSON structures into natural language at index time.

Files to Modify

~/.claude/skills/rag-search/scripts/index_personal.py - Main changes

Implementation

1. Add Process Pattern Recognition (lines ~127-138)

Add handling for process objects with action arrays:

# Process with action arrays (add, remove, rules, requirements, etc.)
action_keys = ["add", "remove", "rules", "requirements", "steps", "validate"]
if any(key in item for key in action_keys):
    parts = []
    if context:
        parts.append(f"{context}:")
    if item.get("description"):
        parts.append(item["description"])

    for action_key in action_keys:
        if action_key in item and isinstance(item[action_key], list):
            action_text = f"To {action_key}: " + ". ".join(item[action_key])
            parts.append(action_text)

    if parts:
        yield (" ".join(parts), {**base_metadata, "process": context})
        return

2. Improve Context Propagation

When processing nested dicts, pass richer context:

# In the top-level dict processing (line ~154-161)
elif isinstance(value, dict):
    # Pass the key as context for better chunk text
    yield from process_item(value, context=key)

Already done, but ensure action arrays get the context.

3. Handle Key-Value Pairs in Processes

For structures like:

"content-principles": {
  "no-redundancy": "Information lives in one authoritative location",
  "lean-files": "Keep files concise..."
}

Transform to: "content-principles: no-redundancy means information lives in one authoritative location. lean-files means keep files concise..."

4. Add Tests

Create a simple test to verify transformation quality:

# After reindex, verify the failing query now works
~/.claude/skills/rag-search/scripts/search.py "how to add a new agent" --index personal
# Should return system-instructions.json with score > 0.7

Expected Outcome

Query	Before	After
"how to add a new agent"	0.479, wrong file	>0.7, system-instructions.json
"agent lifecycle"	Similar	Better match to process
"model selection rules"	Depends	Match model-selection process

Validation Steps

Run modified indexer
Test the three queries above
Compare scores and result relevance

Rollback

If results degrade: git checkout scripts/index_personal.py && reindex

Post-Implementation

Add to future-considerations.json:

RAG indexer debug/verbose mode to inspect what text is being indexed

Future Considerations (Deferred)

Natural language templates per JSON schema type
LLM-generated summaries of complex structures
Caching transformed text alongside original JSON

3.3 KiB Raw Blame History