test(atlas): add quality evaluation fixtures
This commit is contained in:
@@ -0,0 +1,53 @@
|
||||
# Atlas Quality Evaluation Harness
|
||||
|
||||
Low-risk evaluation loop for Atlas and specialist-profile behavior. The harness starts with deterministic fixture validation and dry-run reporting so scenario quality can be reviewed before live model calls are scheduled.
|
||||
|
||||
## Files
|
||||
|
||||
- `scenarios.yaml` — 12 seed scenarios, two per dimension: routing/delegation, coding/tests, review quality, research citations, ops safety, and local-model subtasks.
|
||||
- `run_eval_suite.py` — validator, dry-run JSONL writer, and gated live runner.
|
||||
- `judges.py` — deterministic checks and secret-like fixture scanning.
|
||||
- `results/` — machine-readable JSONL outputs.
|
||||
- `tests/test_atlas_quality_fixtures.py` — regression tests for fixture shape, secret scanning, and dry-run output.
|
||||
|
||||
## Safety defaults
|
||||
|
||||
- Dry-run is the default if no execution mode is selected.
|
||||
- Live Hermes invocation requires `--execute-live` and `ATLAS_EVAL_ALLOW_LIVE=1`.
|
||||
- Scenarios use synthetic prompts and scratch/synthetic setup descriptions.
|
||||
- The validator rejects obvious secret-shaped strings in fixture text.
|
||||
- Backlog creation is documented but not automatic; follow-up Kanban tasks should only be created for blocker-class failures or failures observed twice consecutively.
|
||||
|
||||
## Commands
|
||||
|
||||
Validate fixtures:
|
||||
|
||||
```bash
|
||||
python agent-evals/atlas_quality/run_eval_suite.py --validate-only
|
||||
```
|
||||
|
||||
Dry-run two scenarios and write JSONL:
|
||||
|
||||
```bash
|
||||
python agent-evals/atlas_quality/run_eval_suite.py --dry-run --limit 2 --output /tmp/atlas-eval-test.jsonl
|
||||
```
|
||||
|
||||
Run the smoke subset as dry-run data and append the results note:
|
||||
|
||||
```bash
|
||||
python agent-evals/atlas_quality/run_eval_suite.py --dry-run --tag smoke --output agent-evals/atlas_quality/results/$(date +%F)-smoke.jsonl --results-note "obsidian-vault/will/will-shared-zap/Projects/Atlas Quality Eval Results.md"
|
||||
```
|
||||
|
||||
Optional live execution is intentionally gated. By default each scenario runs with its own `target_profile` and `allowed_toolsets`; use `--profile` only as an explicit debug override:
|
||||
|
||||
```bash
|
||||
ATLAS_EVAL_ALLOW_LIVE=1 python agent-evals/atlas_quality/run_eval_suite.py --execute-live --tag smoke --limit 3
|
||||
```
|
||||
|
||||
Live prompts include only the synthetic setup and user prompt. Expected/forbidden behaviors and scoring rubrics remain hidden for offline judging so an agent cannot pass by echoing the rubric.
|
||||
|
||||
Review transcripts before using live results for backlog creation.
|
||||
|
||||
## Report format
|
||||
|
||||
Each JSONL row records timestamp, evaluator version, profile, provider/model environment hints, scenario id, dimension, toolsets, score, pass/fail status, failure summary, deterministic-check details, transcript path, and optional follow-up task id.
|
||||
Reference in New Issue
Block a user