Atlas Quality Evaluation Harness

Low-risk evaluation loop for Atlas and specialist-profile behavior. The harness starts with deterministic fixture validation and dry-run reporting so scenario quality can be reviewed before live model calls are scheduled.

Files

scenarios.yaml — 12 seed scenarios, two per dimension: routing/delegation, coding/tests, review quality, research citations, ops safety, and local-model subtasks.
run_eval_suite.py — validator, dry-run JSONL writer, and gated live runner.
judges.py — deterministic checks and secret-like fixture scanning.
results/ — machine-readable JSONL outputs.
tests/test_atlas_quality_fixtures.py — regression tests for fixture shape, secret scanning, and dry-run output.

Safety defaults

Dry-run is the default if no execution mode is selected.
Live Hermes invocation requires --execute-live and ATLAS_EVAL_ALLOW_LIVE=1.
Scenarios use synthetic prompts and scratch/synthetic setup descriptions.
The validator rejects obvious secret-shaped strings in fixture text.
Backlog creation is documented but not automatic; follow-up Kanban tasks should only be created for blocker-class failures or failures observed twice consecutively.

Commands

Validate fixtures:

python agent-evals/atlas_quality/run_eval_suite.py --validate-only

Dry-run two scenarios and write JSONL:

python agent-evals/atlas_quality/run_eval_suite.py --dry-run --limit 2 --output /tmp/atlas-eval-test.jsonl

Run the smoke subset as dry-run data and append the results note:

python agent-evals/atlas_quality/run_eval_suite.py --dry-run --tag smoke --output agent-evals/atlas_quality/results/$(date +%F)-smoke.jsonl --results-note "obsidian-vault/will/will-shared-zap/Projects/Atlas Quality Eval Results.md"

Optional live execution is intentionally gated. By default each scenario runs with its own target_profile and allowed_toolsets; use --profile only as an explicit debug override:

ATLAS_EVAL_ALLOW_LIVE=1 python agent-evals/atlas_quality/run_eval_suite.py --execute-live --tag smoke --limit 3

Live prompts include only the synthetic setup and user prompt. Expected/forbidden behaviors and scoring rubrics remain hidden for offline judging so an agent cannot pass by echoing the rubric.

Review transcripts before using live results for backlog creation.

Report format

Each JSONL row records timestamp, evaluator version, profile, provider/model environment hints, scenario id, dimension, toolsets, score, pass/fail status, failure summary, deterministic-check details, transcript path, and optional follow-up task id.

2.6 KiB Raw Blame History

Atlas Quality Evaluation Harness

Files

Safety defaults

Commands

Report format

2.6 KiB

Raw Blame History