feat(npu): add document image triage prototype
This commit is contained in:
@@ -0,0 +1,164 @@
|
||||
# OpenVINO NPU document/image triage prototype
|
||||
|
||||
Local-only, CLI-first prototype for triaging screenshots, photos/scans, and PDF page images.
|
||||
It returns structured JSON metadata and explicitly reports CPU vs NPU stages.
|
||||
Optional HTTP is a localhost/loopback-only prototype on `127.0.0.1:18829` when explicitly started; non-loopback binds are rejected and it is not a live Atlas/Hermes/RAG integration.
|
||||
|
||||
Location: `/home/will/lab/swarm/openvino-doc-image-triage-npu/`
|
||||
|
||||
## Privacy and safety
|
||||
|
||||
- No external uploads.
|
||||
- The only network call is optional localhost-only embeddings at `127.0.0.1:18817`.
|
||||
- Raw OCR/sidecar text is redacted by default and is not logged.
|
||||
- Full source paths are omitted by default; responses include basename and SHA-256.
|
||||
- Allowed roots are enforced for CLI/server requests.
|
||||
- This prototype does not mutate Obsidian, RAG, Chroma, vector collections, routing, or gateway services.
|
||||
- Do not process broad private document/image directories; use generated synthetic fixtures unless Will explicitly approves a narrow source root.
|
||||
- See `SPEC.md` for the full CLI contract, smoke-test plan, NPU verification plan, docs implications, and no-go/defer criteria.
|
||||
|
||||
## CPU vs NPU stages
|
||||
|
||||
CPU:
|
||||
- file intake, allowed-root checks, size checks, hashing
|
||||
- image/PDF decoding/rendering and normalization
|
||||
- optional local text extraction from sidecars or PDF text libraries
|
||||
- regex metadata extraction and rule-based category fallback
|
||||
- final needs-attention rules
|
||||
|
||||
NPU:
|
||||
- needs-attention semantic embedding, via existing local OpenVINO embeddings service on `:18817`
|
||||
- verified with `/sys/class/accel/accel0/device/npu_busy_time_us` before/after each embedding call
|
||||
|
||||
Not configured in v1:
|
||||
- image category classifier on NPU. The JSON reports this as `CPU rule fallback (NPU model not configured in prototype v1)`. A future task can add a static-shape MobileNet/EfficientNet/ResNet OpenVINO IR model.
|
||||
- OCR on NPU. OCR remains CPU/local plumbing in v1.
|
||||
|
||||
## Files
|
||||
|
||||
- `triage.py` — core library and CLI.
|
||||
- `server.py` — stdlib HTTP server with `/healthz`, `/models`, `/triage`, `/triage/batch`.
|
||||
- `make_samples.py` — creates synthetic non-private image/PDF samples.
|
||||
- `tests/smoke_test.py` — end-to-end smoke test, including NPU busy-time verification when `:18817` is reachable.
|
||||
- `samples/` — generated synthetic fixtures.
|
||||
|
||||
## Requirements
|
||||
|
||||
Use the existing NPU venv when available:
|
||||
|
||||
```bash
|
||||
cd /home/will/lab/swarm/openvino-doc-image-triage-npu
|
||||
/home/will/.venvs/npu/bin/python -m pip install pillow
|
||||
```
|
||||
|
||||
`pillow` is already present in the discovered `/home/will/.venvs/npu`. Optional local PDF text/rendering improves PDF support:
|
||||
|
||||
```bash
|
||||
/home/will/.venvs/npu/bin/python -m pip install pypdf pypdfium2
|
||||
```
|
||||
|
||||
The smoke tests do not require external services except the existing localhost `:18817` embeddings service for positive NPU verification.
|
||||
|
||||
## CLI usage
|
||||
|
||||
Generate synthetic samples:
|
||||
|
||||
```bash
|
||||
cd /home/will/lab/swarm/openvino-doc-image-triage-npu
|
||||
/home/will/.venvs/npu/bin/python make_samples.py
|
||||
```
|
||||
|
||||
Triage local files:
|
||||
|
||||
```bash
|
||||
/home/will/.venvs/npu/bin/python triage.py \
|
||||
--allowed-root /home/will/lab/swarm/openvino-doc-image-triage-npu \
|
||||
--pretty \
|
||||
samples/synthetic_invoice.png samples/synthetic_invoice.pdf
|
||||
```
|
||||
|
||||
Disable the local NPU embeddings call if needed:
|
||||
|
||||
```bash
|
||||
/home/will/.venvs/npu/bin/python triage.py --no-embeddings --allowed-root "$PWD" samples/synthetic_receipt.png
|
||||
```
|
||||
|
||||
Include OCR/sidecar text in a single response only when explicitly requested:
|
||||
|
||||
```bash
|
||||
/home/will/.venvs/npu/bin/python triage.py --include-ocr-text --allowed-root "$PWD" samples/synthetic_invoice.png
|
||||
```
|
||||
|
||||
## HTTP usage
|
||||
|
||||
The prototype is CLI-first. HTTP is optional and not enabled by default. If a foreground HTTP server is needed for review, prefer optional port `18829` so it does not collide with the GenAI worker prototype on `18820`. Check the port first:
|
||||
|
||||
```bash
|
||||
ss -ltnp | grep ':18829\b' || true
|
||||
```
|
||||
|
||||
Start a local-only server and stop it after the smoke:
|
||||
|
||||
```bash
|
||||
cd /home/will/lab/swarm/openvino-doc-image-triage-npu
|
||||
/home/will/.venvs/npu/bin/python server.py --host 127.0.0.1 --port 18829 --allowed-root "$PWD"
|
||||
```
|
||||
|
||||
Call it with synthetic/non-private fixtures only:
|
||||
|
||||
```bash
|
||||
curl -sS http://127.0.0.1:18829/healthz | jq
|
||||
curl -sS http://127.0.0.1:18829/models | jq
|
||||
curl -sS -X POST http://127.0.0.1:18829/triage \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"path":"/home/will/lab/swarm/openvino-doc-image-triage-npu/samples/synthetic_invoice.png","options":{"allowed_roots":["/home/will/lab/swarm/openvino-doc-image-triage-npu"]}}' | jq
|
||||
```
|
||||
|
||||
Do not install or enable a persistent service for this prototype without explicit approval, and do not point it at private document/image directories during smoke tests.
|
||||
|
||||
## Smoke test
|
||||
|
||||
```bash
|
||||
cd /home/will/lab/swarm/openvino-doc-image-triage-npu
|
||||
/home/will/.venvs/npu/bin/python tests/smoke_test.py
|
||||
```
|
||||
|
||||
Expected: JSON ending with `"ok": true`. The smoke test generates only synthetic fixtures, verifies non-loopback HTTP binds are rejected, starts its temporary server on a preflighted free localhost port, and terminates it before exit. If the embeddings service is up, the result should show positive NPU busy-time delta and each embedded page should report `verified_npu: true`.
|
||||
|
||||
## Example output shape
|
||||
|
||||
```json
|
||||
{
|
||||
"file_id": "sha256:...",
|
||||
"source_path_basename": "synthetic_invoice.png",
|
||||
"media_type": "image",
|
||||
"page_count": 1,
|
||||
"pages": [
|
||||
{
|
||||
"page_index": 0,
|
||||
"classification": {
|
||||
"label": "bill_or_invoice",
|
||||
"confidence": 0.71,
|
||||
"device": "CPU",
|
||||
"method": "rule_based_fallback"
|
||||
},
|
||||
"needs_attention": {
|
||||
"value": true,
|
||||
"device": "NPU+CPU",
|
||||
"reasons": ["amount_due", "due_date_present"],
|
||||
"embedding": {"verified_npu": true, "npu_busy_delta_us": 12345}
|
||||
},
|
||||
"metadata": {"dates_count": 1, "amounts_count": 1, "raw_values_redacted": true},
|
||||
"ocr": {"available": true, "device": "CPU"}
|
||||
}
|
||||
],
|
||||
"processing_device_summary": {
|
||||
"file_intake": "CPU",
|
||||
"image_category_classification": "CPU rule fallback (NPU model not configured in prototype v1)",
|
||||
"needs_attention_embedding": "NPU via local :18817",
|
||||
"metadata_extraction": "CPU",
|
||||
"npu_verified": true
|
||||
},
|
||||
"privacy": {"external_uploads": false, "raw_text_logged": false}
|
||||
}
|
||||
```
|
||||
@@ -0,0 +1,146 @@
|
||||
# OpenVINO NPU document/image triage spec
|
||||
|
||||
Status: CLI-first prototype specification; not a live Atlas/Hermes integration.
|
||||
|
||||
## Safety stance
|
||||
|
||||
- Default workflow is local CLI execution against explicitly named files.
|
||||
- Optional HTTP is disabled unless a human starts it, is constrained to loopback (`127.0.0.1`, `::1`, or `localhost`), and is intended for `127.0.0.1:18829` only.
|
||||
- No persistent systemd unit, Docker service, gateway hook, Atlas/Hermes route, RAG route, Chroma/vector collection mutation, or in-place reindexing is part of this spec.
|
||||
- Smoke data must be synthetic/non-private only. Do not point this tool at Will's private document, image, screenshot, Downloads, Desktop, Obsidian, or photo-library directories without explicit approval.
|
||||
- NPU claims require `/sys/class/accel/accel0/device/npu_busy_time_us` before/after deltas. HTTP 200, JSON output, or model-load success alone is not NPU proof.
|
||||
|
||||
## Recommended model/runtime
|
||||
|
||||
Recommended v1 runtime:
|
||||
|
||||
- File intake, hashing, MIME/extension checks, image/PDF rendering, sidecar/native PDF text extraction, metadata extraction, and category fallback: local Python CPU path using Pillow plus optional `pypdf`/`pypdfium2`.
|
||||
- Needs-attention semantic check: reuse the live localhost OpenVINO embeddings service on `127.0.0.1:18817`, currently `bge-base-en-v1.5-int8-ov`, and verify each embedding call with `npu_busy_time_us` deltas.
|
||||
- Category classification in v1: CPU rule fallback, explicitly reported as not an NPU image model.
|
||||
|
||||
Why this is the recommended v1:
|
||||
|
||||
- It avoids private-data exposure: no external upload path and no broader local file scanning.
|
||||
- It avoids collection/routing risk by using the existing embeddings API as a stateless feature extractor only; it does not write to RAG or Chroma.
|
||||
- It gives a real NPU verification hook for the semantic stage without overclaiming that OCR/image classification are NPU-backed.
|
||||
- It keeps the prototype useful even when optional PDF dependencies or the embeddings service are unavailable: it can fall back to CPU-only metadata/rule output and mark NPU verification false.
|
||||
|
||||
Deferred model work:
|
||||
|
||||
- NPU image category classifier: defer until a static-shape OpenVINO IR image model such as MobileNet/EfficientNet/ResNet is selected, calibrated for the label set, and smoke-tested with busy-time deltas.
|
||||
- NPU OCR/VLM: defer; OCR remains local CPU text plumbing in v1.
|
||||
|
||||
## CLI contract
|
||||
|
||||
Command:
|
||||
|
||||
```bash
|
||||
cd /home/will/lab/swarm/openvino-doc-image-triage-npu
|
||||
/home/will/.venvs/npu/bin/python triage.py \
|
||||
--allowed-root /home/will/lab/swarm/openvino-doc-image-triage-npu \
|
||||
--max-pages 3 \
|
||||
--pretty \
|
||||
samples/synthetic_invoice.png samples/synthetic_invoice.pdf
|
||||
```
|
||||
|
||||
Inputs:
|
||||
|
||||
- Positional `paths`: one or more local image/PDF paths.
|
||||
- `--allowed-root ROOT`: may repeat; every requested path must resolve under one of these roots. Default is current directory.
|
||||
- `--max-pages N`: maximum rendered/extracted PDF pages; default 3.
|
||||
- `--no-embeddings`: disables the localhost `:18817` embedding/NPU check and reports CPU fallback/no text.
|
||||
- `--dry-run`: skip image/PDF rendering while still checking intake/hash/text/metadata where available.
|
||||
- `--include-ocr-text`: include raw extracted/sidecar text in this single response only; off by default.
|
||||
- `--include-full-path`: include resolved full paths; off by default.
|
||||
- `--pretty`: pretty-print JSON.
|
||||
|
||||
Output:
|
||||
|
||||
- Batch JSON: `{ "ok": bool, "files": [...], "generated_at": "..." }`.
|
||||
- Per file result includes `file_id` as `sha256:<digest>`, `source_path_basename`, media type, file size, pages, classification, needs-attention result, metadata counts/flags, privacy flags, and processing-device summary.
|
||||
- Raw OCR/text and full paths are omitted unless explicitly requested.
|
||||
- NPU evidence is per embedding call: `used`, `verified_npu`, `npu_busy_delta_us`, endpoint, and wall time.
|
||||
|
||||
Exit behavior:
|
||||
|
||||
- Exit 0 when all files triage successfully.
|
||||
- Exit 2 when one or more files fail policy/intake/processing checks.
|
||||
|
||||
## Optional localhost HTTP contract
|
||||
|
||||
HTTP is optional and not enabled by this spec. If explicitly started for a smoke or local demo, use localhost and port 18829:
|
||||
|
||||
```bash
|
||||
cd /home/will/lab/swarm/openvino-doc-image-triage-npu
|
||||
ss -ltnp | grep ':18829\b' || true
|
||||
/home/will/.venvs/npu/bin/python server.py --host 127.0.0.1 --port 18829 --allowed-root "$PWD"
|
||||
```
|
||||
|
||||
Endpoints:
|
||||
|
||||
- `GET /healthz` or `/health`: service name, bind policy, configured allowed roots, privacy flags, and current `npu_busy_time_us`.
|
||||
- `GET /models`: reports v1 stages and whether each is CPU or NPU-backed.
|
||||
- `POST /triage`: `{ "path": "/local/file", "options": {...} }` -> `{ "ok": true, "result": ... }`.
|
||||
- `POST /triage/batch`: `{ "paths": ["/local/file"], "options": {...} }` -> batch JSON.
|
||||
|
||||
HTTP privacy/policy rules:
|
||||
|
||||
- Server startup `--allowed-root` is the outer allowlist.
|
||||
- Request `options.allowed_roots` may narrow that allowlist but must not widen it.
|
||||
- Request `options.embedding_url` may only target the configured local loopback embeddings route `http://127.0.0.1:18817/v1/embeddings` (or localhost equivalent); external or alternate endpoints are rejected.
|
||||
- Request bodies and raw text are not logged by the stdlib handler.
|
||||
- Stop the temporary server after the smoke/demo.
|
||||
|
||||
## Synthetic smoke-test plan
|
||||
|
||||
Use only generated fixtures under the prototype directory:
|
||||
|
||||
```bash
|
||||
cd /home/will/lab/swarm/openvino-doc-image-triage-npu
|
||||
/home/will/.venvs/npu/bin/python make_samples.py
|
||||
/home/will/.venvs/npu/bin/python tests/smoke_test.py
|
||||
```
|
||||
|
||||
Expected smoke coverage:
|
||||
|
||||
- Creates synthetic invoice/receipt/form-like image/PDF fixtures.
|
||||
- Runs CLI triage against the synthetic invoice image/PDF under an explicit allowed root.
|
||||
- Asserts privacy flags (`external_uploads: false`, no full path by default).
|
||||
- Asserts invoice category/needs-attention behavior on synthetic text.
|
||||
- Starts a temporary localhost HTTP server on a preflighted free ephemeral port, calls `/healthz` and `/triage`, verifies no full path leakage, rejects attempts to widen allowed roots, rejects external embedding URLs, and verifies non-loopback binds are rejected.
|
||||
- Terminates the temporary server.
|
||||
|
||||
The smoke port in tests should stay OS-assigned ephemeral/non-live to avoid claiming `18829` as a persistent service.
|
||||
|
||||
## NPU busy-time verification plan
|
||||
|
||||
For every test that claims NPU use:
|
||||
|
||||
1. Read `/sys/class/accel/accel0/device/npu_busy_time_us` before the operation.
|
||||
2. Perform an operation that should call the live embeddings service on `127.0.0.1:18817` with non-empty synthetic text.
|
||||
3. Read `npu_busy_time_us` after the operation.
|
||||
4. Require both:
|
||||
- the per-result embedding object reports `used: true`, `verified_npu: true`, and `npu_busy_delta_us > 0`; and
|
||||
- the outer before/after sysfs value increased.
|
||||
5. If sysfs is missing or `:18817` is unavailable, do not claim NPU success; report CPU fallback / embedding unavailable and keep the smoke result honest.
|
||||
|
||||
## Docs and diagram implications
|
||||
|
||||
- Service maps should list document/image triage as CLI-first and optional prototype `127.0.0.1:18829`, not live unless explicitly started.
|
||||
- Diagrams must not draw live Atlas/Hermes/gateway/RAG routing to this triage lane.
|
||||
- If shown with other candidate sidecars, label it separately from live services: live baseline remains RAG `:18810`, Whisper NPU `:18816`, and embeddings `:18817`; prototype sidecars are reranker `:18818`, classifier/router `:18819`, GenAI worker `:18820`, and optional doc/image triage `:18829`.
|
||||
- Runbooks should include CLI smoke, localhost listener checks, busy-time delta verification, and server shutdown instructions.
|
||||
- Documentation should state CPU vs NPU stages explicitly so the prototype does not imply NPU OCR or NPU image classification.
|
||||
|
||||
## No-go / defer criteria
|
||||
|
||||
Do not proceed to implementation, live integration, or persistent service enablement if any of these are true:
|
||||
|
||||
- Will has not explicitly approved live routing or persistent service enablement.
|
||||
- The requested source path is a private document/image directory or broad home-directory scan rather than synthetic fixtures or an explicitly approved narrow root.
|
||||
- The workflow would mutate Obsidian, RAG, Chroma/vector collections, or reindex in place.
|
||||
- The optional server would need to bind anywhere other than localhost.
|
||||
- NPU busy-time does not increase for an operation being described as NPU-backed.
|
||||
- Raw OCR text or full paths would be logged, uploaded, stored durably, or returned without explicit request.
|
||||
- PDF/image dependencies are missing and the task requires rendered page analysis rather than metadata/text-only fallback.
|
||||
- A future image classifier/OCR/VLM model has not been selected, converted/quantized to OpenVINO, calibrated for the task, and verified on synthetic fixtures with busy-time deltas.
|
||||
@@ -0,0 +1,69 @@
|
||||
#!/usr/bin/env python3
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
from PIL import Image, ImageDraw, ImageFilter
|
||||
|
||||
ROOT = Path(__file__).resolve().parent
|
||||
SAMPLES = ROOT / "samples"
|
||||
|
||||
|
||||
def make_doc(path: Path, lines: list[str], size=(900, 1200), rotate: int = 0, blur: bool = False) -> None:
|
||||
img = Image.new("RGB", size, "white")
|
||||
draw = ImageDraw.Draw(img)
|
||||
y = 70
|
||||
for line in lines:
|
||||
draw.text((70, y), line, fill="black")
|
||||
y += 55
|
||||
draw.rectangle((55, 50, size[0] - 55, min(size[1] - 50, y + 30)), outline="gray", width=3)
|
||||
if blur:
|
||||
img = img.filter(ImageFilter.GaussianBlur(2.5))
|
||||
if rotate:
|
||||
img = img.rotate(rotate, expand=True, fillcolor="white")
|
||||
img.save(path)
|
||||
path.with_suffix(path.suffix + ".txt").write_text("\n".join(lines) + "\n")
|
||||
|
||||
|
||||
def main() -> int:
|
||||
SAMPLES.mkdir(exist_ok=True)
|
||||
make_doc(SAMPLES / "synthetic_invoice.png", [
|
||||
"ACME Utilities Invoice",
|
||||
"Invoice No: INV-2026-0604",
|
||||
"Amount Due: $123.45",
|
||||
"Payment due 2026-06-30",
|
||||
"Please submit payment by the due date.",
|
||||
])
|
||||
make_doc(SAMPLES / "synthetic_receipt.png", [
|
||||
"Neighborhood Store Receipt",
|
||||
"Subtotal $14.20",
|
||||
"Tax $1.42",
|
||||
"Total $15.62",
|
||||
"Thank you for shopping",
|
||||
], size=(720, 1100), rotate=3)
|
||||
make_doc(SAMPLES / "synthetic_conversation.png", [
|
||||
"Messages with Alex",
|
||||
"Can you please respond by tomorrow?",
|
||||
"Need signature on the form before Friday.",
|
||||
], size=(1200, 750))
|
||||
make_doc(SAMPLES / "synthetic_sensitive_form.png", [
|
||||
"Sample Government Form - Fake Data",
|
||||
"Applicant: Test Person",
|
||||
"SSN: 123-45-6789",
|
||||
"Signature required",
|
||||
"Submit by Jan 15, 2027",
|
||||
], blur=False)
|
||||
make_doc(SAMPLES / "synthetic_blurry.png", [
|
||||
"Low resolution blurred sample",
|
||||
"No action required",
|
||||
], size=(360, 250), blur=True)
|
||||
# PIL can save a simple local PDF from a synthetic page. This is non-private.
|
||||
pdf_img = Image.open(SAMPLES / "synthetic_invoice.png").convert("RGB")
|
||||
pdf_img.save(SAMPLES / "synthetic_invoice.pdf", "PDF")
|
||||
(SAMPLES / "synthetic_invoice.pdf.txt").write_text((SAMPLES / "synthetic_invoice.png.txt").read_text())
|
||||
print(f"wrote samples under {SAMPLES}")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
Binary file not shown.
|
After Width: | Height: | Size: 4.5 KiB |
@@ -0,0 +1,2 @@
|
||||
Low resolution blurred sample
|
||||
No action required
|
||||
Binary file not shown.
|
After Width: | Height: | Size: 9.1 KiB |
@@ -0,0 +1,3 @@
|
||||
Messages with Alex
|
||||
Can you please respond by tomorrow?
|
||||
Need signature on the form before Friday.
|
||||
Binary file not shown.
@@ -0,0 +1,5 @@
|
||||
ACME Utilities Invoice
|
||||
Invoice No: INV-2026-0604
|
||||
Amount Due: $123.45
|
||||
Payment due 2026-06-30
|
||||
Please submit payment by the due date.
|
||||
Binary file not shown.
|
After Width: | Height: | Size: 13 KiB |
@@ -0,0 +1,5 @@
|
||||
ACME Utilities Invoice
|
||||
Invoice No: INV-2026-0604
|
||||
Amount Due: $123.45
|
||||
Payment due 2026-06-30
|
||||
Please submit payment by the due date.
|
||||
Binary file not shown.
|
After Width: | Height: | Size: 12 KiB |
@@ -0,0 +1,5 @@
|
||||
Neighborhood Store Receipt
|
||||
Subtotal $14.20
|
||||
Tax $1.42
|
||||
Total $15.62
|
||||
Thank you for shopping
|
||||
Binary file not shown.
|
After Width: | Height: | Size: 12 KiB |
@@ -0,0 +1,5 @@
|
||||
Sample Government Form - Fake Data
|
||||
Applicant: Test Person
|
||||
SSN: 123-45-6789
|
||||
Signature required
|
||||
Submit by Jan 15, 2027
|
||||
@@ -0,0 +1,196 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Stdlib localhost HTTP wrapper for the triage prototype.
|
||||
|
||||
Endpoints:
|
||||
- GET /healthz
|
||||
- GET /models
|
||||
- POST /triage JSON: {"path":"/local/file", "options": {...}}
|
||||
- POST /triage/batch JSON: {"paths":["/local/file"], "options": {...}}
|
||||
|
||||
The server binds to 127.0.0.1 by default and accepts only local file paths under
|
||||
configured allowed roots. It never uploads document/image contents externally.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import ipaddress
|
||||
import json
|
||||
import os
|
||||
from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
from urllib.parse import urlparse
|
||||
|
||||
from triage import DEFAULT_EMBED_URL, TriageOptions, read_npu_busy, triage_batch, triage_file
|
||||
|
||||
|
||||
def _validate_loopback_host(host: str) -> str:
|
||||
"""Reject non-loopback binds; this prototype is never a LAN service."""
|
||||
normalized = host.strip()
|
||||
if normalized == "localhost":
|
||||
return normalized
|
||||
try:
|
||||
if ipaddress.ip_address(normalized).is_loopback:
|
||||
return normalized
|
||||
except ValueError:
|
||||
pass
|
||||
raise ValueError("host must be localhost/loopback for this prototype")
|
||||
|
||||
|
||||
def _roots_within_configured(requested_roots: list[Any], configured_roots: list[Path]) -> list[Path]:
|
||||
"""Return request roots only when they narrow the startup allowlist."""
|
||||
narrowed: list[Path] = []
|
||||
configured = [root.expanduser().resolve() for root in configured_roots]
|
||||
for raw in requested_roots:
|
||||
candidate = Path(str(raw)).expanduser().resolve()
|
||||
if any(candidate == root or candidate.is_relative_to(root) for root in configured):
|
||||
narrowed.append(candidate)
|
||||
else:
|
||||
raise ValueError("requested allowed_roots must be within configured allowed roots")
|
||||
return narrowed
|
||||
|
||||
|
||||
def _validated_embedding_url(raw_url: Any) -> str:
|
||||
"""Allow only the configured local loopback embeddings service."""
|
||||
url = str(raw_url)
|
||||
parsed = urlparse(url)
|
||||
host = parsed.hostname or ""
|
||||
if (
|
||||
parsed.scheme == "http"
|
||||
and host in {"127.0.0.1", "localhost", "::1"}
|
||||
and (parsed.port or 80) == 18817
|
||||
and parsed.path == "/v1/embeddings"
|
||||
and not parsed.username
|
||||
and not parsed.password
|
||||
):
|
||||
return url
|
||||
raise ValueError("embedding_url override must target the configured local loopback embeddings service")
|
||||
|
||||
|
||||
def make_options(payload: dict[str, Any], default_roots: list[Path]) -> TriageOptions:
|
||||
opts = payload.get("options") or {}
|
||||
requested_roots = opts.get("allowed_roots", [])
|
||||
if requested_roots:
|
||||
if not isinstance(requested_roots, list):
|
||||
raise ValueError("allowed_roots must be a list")
|
||||
roots = _roots_within_configured(requested_roots, default_roots)
|
||||
else:
|
||||
roots = default_roots
|
||||
embedding_url = DEFAULT_EMBED_URL
|
||||
if "embedding_url" in opts:
|
||||
embedding_url = _validated_embedding_url(opts["embedding_url"])
|
||||
return TriageOptions(
|
||||
max_pages=int(opts.get("max_pages", 3)),
|
||||
include_ocr_text=bool(opts.get("include_ocr_text", False)),
|
||||
dry_run=bool(opts.get("dry_run", False)),
|
||||
use_embeddings=bool(opts.get("use_embeddings", True)),
|
||||
embedding_url=embedding_url,
|
||||
allowed_roots=roots,
|
||||
include_full_path=bool(opts.get("include_full_path", False)),
|
||||
)
|
||||
|
||||
|
||||
class Handler(BaseHTTPRequestHandler):
|
||||
server_version = "openvino-doc-image-triage-npu/0.1"
|
||||
|
||||
def _json(self, status: int, body: dict[str, Any]) -> None:
|
||||
data = json.dumps(body, sort_keys=True).encode()
|
||||
self.send_response(status)
|
||||
self.send_header("Content-Type", "application/json")
|
||||
self.send_header("Content-Length", str(len(data)))
|
||||
self.end_headers()
|
||||
self.wfile.write(data)
|
||||
|
||||
def log_message(self, format: str, *args: Any) -> None:
|
||||
# Do not log request bodies, OCR text, or file paths.
|
||||
return
|
||||
|
||||
@property
|
||||
def allowed_roots(self) -> list[Path]:
|
||||
return self.server.allowed_roots # type: ignore[attr-defined]
|
||||
|
||||
def do_GET(self) -> None: # noqa: N802
|
||||
if self.path in ("/", "/healthz", "/health"):
|
||||
self._json(200, {
|
||||
"ok": True,
|
||||
"service": "openvino-doc-image-triage-npu",
|
||||
"bind_policy": "localhost-default",
|
||||
"npu_busy_time_us": read_npu_busy(),
|
||||
"npu_busy_check_enabled": True,
|
||||
"allowed_roots": [str(p) for p in self.allowed_roots],
|
||||
"privacy": {"external_uploads": False, "raw_text_logged": False},
|
||||
})
|
||||
return
|
||||
if self.path == "/models":
|
||||
self._json(200, {
|
||||
"models": [
|
||||
{
|
||||
"stage": "needs_attention_embedding",
|
||||
"model": "bge-base-en-v1.5-int8-ov via local :18817",
|
||||
"target_device": "NPU",
|
||||
"verification": "sysfs npu_busy_time_us before/after embedding call",
|
||||
},
|
||||
{
|
||||
"stage": "image_category_classification",
|
||||
"model": "rule-based fallback in prototype v1",
|
||||
"target_device": "CPU",
|
||||
"npu_status": "not configured; future static-shape MobileNet/EfficientNet/ResNet OV IR",
|
||||
},
|
||||
{"stage": "ocr_text_extraction", "model": "optional local sidecar/PDF text", "target_device": "CPU"},
|
||||
]
|
||||
})
|
||||
return
|
||||
self._json(404, {"ok": False, "error": "not_found"})
|
||||
|
||||
def _read_payload(self) -> dict[str, Any]:
|
||||
length = int(self.headers.get("Content-Length", "0"))
|
||||
if length > 512 * 1024:
|
||||
raise ValueError("request JSON too large")
|
||||
raw = self.rfile.read(length)
|
||||
if not raw:
|
||||
return {}
|
||||
return json.loads(raw.decode())
|
||||
|
||||
def do_POST(self) -> None: # noqa: N802
|
||||
try:
|
||||
payload = self._read_payload()
|
||||
options = make_options(payload, self.allowed_roots)
|
||||
if self.path == "/triage":
|
||||
path = payload.get("path")
|
||||
if not path:
|
||||
self._json(400, {"ok": False, "error": "missing_path"})
|
||||
return
|
||||
self._json(200, {"ok": True, "result": triage_file(path, options)})
|
||||
return
|
||||
if self.path == "/triage/batch":
|
||||
paths = payload.get("paths") or []
|
||||
if not isinstance(paths, list) or not paths:
|
||||
self._json(400, {"ok": False, "error": "missing_paths"})
|
||||
return
|
||||
self._json(200, triage_batch([str(p) for p in paths], options))
|
||||
return
|
||||
self._json(404, {"ok": False, "error": "not_found"})
|
||||
except Exception as exc:
|
||||
self._json(400, {"ok": False, "error": type(exc).__name__, "message": str(exc)})
|
||||
|
||||
|
||||
def main() -> int:
|
||||
parser = argparse.ArgumentParser(description="Local-only doc/image triage HTTP server")
|
||||
parser.add_argument("--host", default=os.environ.get("DOC_IMAGE_TRIAGE_HOST", "127.0.0.1"))
|
||||
parser.add_argument("--port", type=int, default=int(os.environ.get("DOC_IMAGE_TRIAGE_PORT", "18829")))
|
||||
parser.add_argument("--allowed-root", action="append", default=[], help="allowed local root; may repeat")
|
||||
args = parser.parse_args()
|
||||
try:
|
||||
host = _validate_loopback_host(args.host)
|
||||
except ValueError as exc:
|
||||
parser.error(str(exc))
|
||||
roots = [Path(p).expanduser().resolve() for p in args.allowed_root] or [Path.cwd().resolve()]
|
||||
httpd = ThreadingHTTPServer((host, args.port), Handler)
|
||||
httpd.allowed_roots = roots # type: ignore[attr-defined]
|
||||
print(json.dumps({"service": "openvino-doc-image-triage-npu", "host": host, "port": args.port, "allowed_roots": [str(p) for p in roots]}), flush=True)
|
||||
httpd.serve_forever()
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
@@ -0,0 +1,154 @@
|
||||
#!/usr/bin/env python3
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import socket
|
||||
import subprocess
|
||||
import sys
|
||||
import tempfile
|
||||
import time
|
||||
import urllib.error
|
||||
import urllib.request
|
||||
from pathlib import Path
|
||||
|
||||
ROOT = Path(__file__).resolve().parents[1]
|
||||
SAMPLES = ROOT / "samples"
|
||||
BUSY = Path("/sys/class/accel/accel0/device/npu_busy_time_us")
|
||||
|
||||
|
||||
def run(cmd: list[str]) -> None:
|
||||
print("+", " ".join(cmd))
|
||||
subprocess.run(cmd, cwd=ROOT, check=True)
|
||||
|
||||
|
||||
def post_json(url: str, payload: dict) -> dict:
|
||||
req = urllib.request.Request(url, data=json.dumps(payload).encode(), headers={"Content-Type": "application/json"})
|
||||
with urllib.request.urlopen(req, timeout=10) as resp:
|
||||
return json.loads(resp.read().decode())
|
||||
|
||||
|
||||
def post_json_status(url: str, payload: dict) -> tuple[int, dict]:
|
||||
req = urllib.request.Request(url, data=json.dumps(payload).encode(), headers={"Content-Type": "application/json"})
|
||||
try:
|
||||
with urllib.request.urlopen(req, timeout=10) as resp:
|
||||
return resp.status, json.loads(resp.read().decode())
|
||||
except urllib.error.HTTPError as exc:
|
||||
return exc.code, json.loads(exc.read().decode())
|
||||
|
||||
|
||||
def busy() -> int | None:
|
||||
try:
|
||||
return int(BUSY.read_text().strip())
|
||||
except Exception:
|
||||
return None
|
||||
|
||||
|
||||
def choose_free_loopback_port() -> int:
|
||||
"""Ask the OS for a free localhost port and verify it is not listening yet."""
|
||||
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
|
||||
sock.bind(("127.0.0.1", 0))
|
||||
port = int(sock.getsockname()[1])
|
||||
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as probe:
|
||||
probe.settimeout(0.25)
|
||||
assert probe.connect_ex(("127.0.0.1", port)) != 0, f"selected port already has a listener: {port}"
|
||||
return port
|
||||
|
||||
|
||||
def assert_loopback_bind_policy() -> None:
|
||||
blocked = subprocess.run(
|
||||
[sys.executable, "server.py", "--host", "0.0.0.0", "--port", "0", "--allowed-root", str(ROOT)],
|
||||
cwd=ROOT,
|
||||
stdout=subprocess.PIPE,
|
||||
stderr=subprocess.PIPE,
|
||||
text=True,
|
||||
)
|
||||
assert blocked.returncode != 0, blocked.stdout + blocked.stderr
|
||||
assert "loopback" in blocked.stderr.lower(), blocked.stderr
|
||||
|
||||
|
||||
def main() -> int:
|
||||
run([sys.executable, "make_samples.py"])
|
||||
invoice = SAMPLES / "synthetic_invoice.png"
|
||||
pdf = SAMPLES / "synthetic_invoice.pdf"
|
||||
|
||||
before = busy()
|
||||
raw = subprocess.check_output([
|
||||
sys.executable, "triage.py", "--allowed-root", str(ROOT), "--pretty", str(invoice), str(pdf)
|
||||
], cwd=ROOT, text=True)
|
||||
data = json.loads(raw)
|
||||
assert data["ok"], data
|
||||
first = data["files"][0]["result"]
|
||||
assert first["privacy"]["external_uploads"] is False
|
||||
assert first["pages"][0]["classification"]["label"] == "bill_or_invoice"
|
||||
assert first["pages"][0]["needs_attention"]["value"] is True
|
||||
assert "amount_due" in first["pages"][0]["needs_attention"]["reasons"]
|
||||
assert first["processing_device_summary"]["file_intake"] == "CPU"
|
||||
assert "NPU" in first["processing_device_summary"]["needs_attention_embedding"] or first["pages"][0]["needs_attention"]["device"] == "CPU"
|
||||
after = busy()
|
||||
if before is not None and after is not None:
|
||||
# If :18817 is reachable and text was embedded, NPU delta must be positive.
|
||||
emb = first["pages"][0]["needs_attention"]["embedding"]
|
||||
if emb.get("used"):
|
||||
assert emb.get("verified_npu") is True, emb
|
||||
assert (emb.get("npu_busy_delta_us") or 0) > 0, emb
|
||||
assert after > before, {"before": before, "after": after, "embedding": emb}
|
||||
|
||||
# HTTP smoke on a preflighted free localhost port so we do not collide with live/prototype ports.
|
||||
assert_loopback_bind_policy()
|
||||
smoke_port = choose_free_loopback_port()
|
||||
base_url = f"http://127.0.0.1:{smoke_port}"
|
||||
proc = subprocess.Popen([sys.executable, "server.py", "--host", "127.0.0.1", "--port", str(smoke_port), "--allowed-root", str(ROOT)], cwd=ROOT, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
|
||||
try:
|
||||
deadline = time.time() + 5
|
||||
while time.time() < deadline:
|
||||
try:
|
||||
health = urllib.request.urlopen(f"{base_url}/healthz", timeout=1).read()
|
||||
assert b"openvino-doc-image-triage-npu" in health
|
||||
break
|
||||
except Exception:
|
||||
time.sleep(0.1)
|
||||
else:
|
||||
raise AssertionError("server did not become ready")
|
||||
resp = post_json(f"{base_url}/triage", {"path": str(invoice), "options": {"allowed_roots": [str(ROOT)]}})
|
||||
assert resp["ok"] is True, resp
|
||||
assert resp["result"]["source_path_basename"] == "synthetic_invoice.png"
|
||||
assert "source_path" not in resp["result"]
|
||||
|
||||
# Request bodies may narrow but must not widen the startup --allowed-root policy.
|
||||
with tempfile.NamedTemporaryFile(suffix=".txt") as outside:
|
||||
outside.write(b"sensitive text outside configured artifact root")
|
||||
outside.flush()
|
||||
status, blocked = post_json_status(
|
||||
f"{base_url}/triage",
|
||||
{"path": outside.name, "options": {"allowed_roots": ["/tmp"], "dry_run": True, "use_embeddings": False}},
|
||||
)
|
||||
assert status == 400, blocked
|
||||
assert blocked["ok"] is False, blocked
|
||||
assert "allowed_roots" in blocked.get("message", ""), blocked
|
||||
|
||||
# Request bodies must not redirect extracted text to caller-supplied endpoints.
|
||||
status, blocked = post_json_status(
|
||||
f"{base_url}/triage",
|
||||
{"path": str(invoice), "options": {"embedding_url": "http://198.51.100.1:9/v1/embeddings"}},
|
||||
)
|
||||
assert status == 400, blocked
|
||||
assert blocked["ok"] is False, blocked
|
||||
assert "embedding_url" in blocked.get("message", ""), blocked
|
||||
finally:
|
||||
proc.terminate()
|
||||
proc.wait(timeout=5)
|
||||
|
||||
print(json.dumps({
|
||||
"ok": True,
|
||||
"samples": len(list(SAMPLES.glob("synthetic_*"))),
|
||||
"npu_busy_before": before,
|
||||
"npu_busy_after": after,
|
||||
"npu_delta_observed": None if before is None or after is None else after - before,
|
||||
"triage_label": first["pages"][0]["classification"]["label"],
|
||||
"needs_attention": first["pages"][0]["needs_attention"]["value"],
|
||||
}, indent=2))
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
@@ -0,0 +1,459 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Local-only document/image triage prototype.
|
||||
|
||||
CPU stages:
|
||||
- local file intake, hashing, MIME/extension checks
|
||||
- image/PDF-page decoding and normalization
|
||||
- optional sidecar/native-text extraction
|
||||
- regex metadata extraction and rule-based category fallback
|
||||
|
||||
NPU stages:
|
||||
- needs-attention semantic embedding via the existing local OpenVINO NPU
|
||||
embeddings service on 127.0.0.1:18817, verified by sysfs busy-time delta.
|
||||
|
||||
No external uploads are performed. The only network call is localhost to the
|
||||
embedding service when enabled.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import base64
|
||||
import dataclasses
|
||||
import datetime as dt
|
||||
import hashlib
|
||||
import io
|
||||
import json
|
||||
import mimetypes
|
||||
import os
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
import urllib.error
|
||||
import urllib.request
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
try:
|
||||
from PIL import Image, ImageOps
|
||||
except Exception as exc: # pragma: no cover - caught in CLI smoke
|
||||
raise SystemExit("Pillow is required: install pillow in the active Python env") from exc
|
||||
|
||||
NPU_BUSY_PATH = Path("/sys/class/accel/accel0/device/npu_busy_time_us")
|
||||
DEFAULT_EMBED_URL = "http://127.0.0.1:18817/v1/embeddings"
|
||||
DEFAULT_ALLOWED_ROOTS = [Path.cwd()]
|
||||
MAX_FILE_BYTES = 25 * 1024 * 1024
|
||||
CATEGORY_LABELS = [
|
||||
"receipt",
|
||||
"bill_or_invoice",
|
||||
"tax_or_financial",
|
||||
"medical_or_insurance",
|
||||
"legal_or_government",
|
||||
"form_or_application",
|
||||
"travel_or_ticket",
|
||||
"screenshot_conversation",
|
||||
"screenshot_web_or_app",
|
||||
"identity_or_sensitive",
|
||||
"photo_misc",
|
||||
"unknown_or_low_confidence",
|
||||
]
|
||||
|
||||
DATE_PATTERNS = [
|
||||
re.compile(r"\b(20\d{2}[-/](?:0?[1-9]|1[0-2])[-/](?:0?[1-9]|[12]\d|3[01]))\b"),
|
||||
re.compile(r"\b((?:0?[1-9]|1[0-2])[-/](?:0?[1-9]|[12]\d|3[01])[-/](?:20)?\d{2})\b"),
|
||||
re.compile(r"\b((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*\s+\d{1,2},?\s+20\d{2})\b", re.I),
|
||||
]
|
||||
AMOUNT_RE = re.compile(r"(?<!\w)(?:USD\s*)?\$\s?\d{1,4}(?:,\d{3})*(?:\.\d{2})?\b", re.I)
|
||||
EMAIL_RE = re.compile(r"\b[\w.+-]+@[\w.-]+\.[A-Za-z]{2,}\b")
|
||||
PHONE_RE = re.compile(r"\b(?:\+?1[-.\s]?)?(?:\(?\d{3}\)?[-.\s]?){2}\d{4}\b")
|
||||
ACCOUNT_RE = re.compile(r"\b(?:account|acct|policy|invoice|member|case|claim)\s*(?:#|no\.?|id)?\s*[:\-]?\s*[A-Z0-9-]{4,}\b", re.I)
|
||||
SSN_LIKE_RE = re.compile(r"\b\d{3}-\d{2}-\d{4}\b")
|
||||
|
||||
ATTENTION_KEYWORDS = {
|
||||
"due_date_present": ["due", "payment due", "pay by", "deadline"],
|
||||
"amount_due": ["amount due", "balance due", "total due", "$"],
|
||||
"action_required_language": ["action required", "please respond", "complete", "submit", "renew", "verify"],
|
||||
"signature_required": ["signature", "sign and return", "signed"],
|
||||
"appointment_or_deadline": ["appointment", "scheduled", "reservation", "hearing"],
|
||||
"account_security": ["security", "password", "unauthorized", "fraud", "verify your account"],
|
||||
"medical_followup": ["follow up", "lab result", "referral", "insurance"],
|
||||
"tax_deadline": ["irs", "tax", "1099", "w-2", "deadline"],
|
||||
}
|
||||
|
||||
CATEGORY_KEYWORDS = {
|
||||
"receipt": ["receipt", "subtotal", "cashier", "change", "store"],
|
||||
"bill_or_invoice": ["invoice", "amount due", "balance due", "statement", "payment due"],
|
||||
"tax_or_financial": ["tax", "irs", "1099", "w-2", "bank", "routing"],
|
||||
"medical_or_insurance": ["medical", "insurance", "clinic", "patient", "claim"],
|
||||
"legal_or_government": ["court", "government", "department", "notice", "license"],
|
||||
"form_or_application": ["application", "form", "signature", "submit"],
|
||||
"travel_or_ticket": ["boarding", "ticket", "itinerary", "reservation", "gate"],
|
||||
"screenshot_conversation": ["message", "chat", "reply", "conversation"],
|
||||
"screenshot_web_or_app": ["login", "browser", "app", "settings", "dashboard"],
|
||||
"identity_or_sensitive": ["ssn", "passport", "driver license", "social security"],
|
||||
}
|
||||
|
||||
|
||||
@dataclasses.dataclass
|
||||
class TriageOptions:
|
||||
max_pages: int = 3
|
||||
include_ocr_text: bool = False
|
||||
dry_run: bool = False
|
||||
use_embeddings: bool = True
|
||||
embedding_url: str = DEFAULT_EMBED_URL
|
||||
allowed_roots: list[Path] = dataclasses.field(default_factory=lambda: DEFAULT_ALLOWED_ROOTS.copy())
|
||||
include_full_path: bool = False
|
||||
timeout_seconds: float = 10.0
|
||||
|
||||
|
||||
def read_npu_busy() -> int | None:
|
||||
try:
|
||||
return int(NPU_BUSY_PATH.read_text().strip())
|
||||
except Exception:
|
||||
return None
|
||||
|
||||
|
||||
def sha256_file(path: Path) -> str:
|
||||
h = hashlib.sha256()
|
||||
with path.open("rb") as f:
|
||||
for chunk in iter(lambda: f.read(1024 * 1024), b""):
|
||||
h.update(chunk)
|
||||
return h.hexdigest()
|
||||
|
||||
|
||||
def under_allowed_root(path: Path, roots: list[Path]) -> bool:
|
||||
resolved = path.resolve()
|
||||
for root in roots:
|
||||
try:
|
||||
resolved.relative_to(root.resolve())
|
||||
return True
|
||||
except ValueError:
|
||||
continue
|
||||
return False
|
||||
|
||||
|
||||
def sidecar_text(path: Path) -> tuple[str, str | None]:
|
||||
for suffix in (path.suffix + ".txt", ".txt"):
|
||||
candidate = path.with_suffix(suffix) if suffix.startswith(path.suffix) else path.with_suffix(suffix)
|
||||
if candidate.exists() and candidate.is_file():
|
||||
try:
|
||||
return candidate.read_text(errors="replace")[:12000], f"sidecar:{candidate.name}"
|
||||
except Exception:
|
||||
return "", "sidecar_unreadable"
|
||||
return "", None
|
||||
|
||||
|
||||
def extract_pdf_text(path: Path, max_pages: int) -> tuple[str, str | None]:
|
||||
# Optional dependency; tests do not require it. Keeps PDF support local-only when installed.
|
||||
try:
|
||||
import pypdf # type: ignore
|
||||
except Exception:
|
||||
return "", "pypdf_not_installed"
|
||||
try:
|
||||
reader = pypdf.PdfReader(str(path))
|
||||
if getattr(reader, "is_encrypted", False):
|
||||
return "", "pdf_encrypted"
|
||||
chunks = []
|
||||
for page in reader.pages[:max_pages]:
|
||||
chunks.append(page.extract_text() or "")
|
||||
return "\n".join(chunks)[:12000], "pypdf_cpu"
|
||||
except Exception as exc:
|
||||
return "", f"pdf_text_error:{type(exc).__name__}"
|
||||
|
||||
|
||||
def load_image_pages(path: Path, max_pages: int) -> tuple[list[Image.Image], str | None]:
|
||||
ext = path.suffix.lower()
|
||||
if ext == ".pdf":
|
||||
try:
|
||||
import pypdfium2 as pdfium # type: ignore
|
||||
except Exception:
|
||||
return [], "pypdfium2_not_installed"
|
||||
try:
|
||||
pdf = pdfium.PdfDocument(str(path))
|
||||
pages = []
|
||||
for i in range(min(len(pdf), max_pages)):
|
||||
bitmap = pdf[i].render(scale=1.5)
|
||||
pages.append(bitmap.to_pil().convert("RGB"))
|
||||
return pages, None
|
||||
except Exception as exc:
|
||||
return [], f"pdf_render_error:{type(exc).__name__}"
|
||||
try:
|
||||
img = Image.open(path)
|
||||
img = ImageOps.exif_transpose(img).convert("RGB")
|
||||
return [img], None
|
||||
except Exception as exc:
|
||||
return [], f"image_decode_error:{type(exc).__name__}"
|
||||
|
||||
|
||||
def normalize_for_hash_features(img: Image.Image) -> dict[str, Any]:
|
||||
small = ImageOps.contain(img.copy(), (224, 224))
|
||||
gray = small.convert("L")
|
||||
hist = gray.histogram()
|
||||
pixels = max(1, gray.width * gray.height)
|
||||
mean = sum(i * c for i, c in enumerate(hist)) / pixels
|
||||
variance = sum(((i - mean) ** 2) * c for i, c in enumerate(hist)) / pixels
|
||||
return {
|
||||
"mean_luma": round(mean, 2),
|
||||
"contrast": round(variance ** 0.5, 2),
|
||||
"aspect_ratio": round(img.width / max(1, img.height), 3),
|
||||
}
|
||||
|
||||
|
||||
def classify_rule(text: str, image_features: dict[str, Any]) -> dict[str, Any]:
|
||||
t = text.lower()
|
||||
best_label = "unknown_or_low_confidence"
|
||||
best_score = 0
|
||||
for label, words in CATEGORY_KEYWORDS.items():
|
||||
score = sum(1 for word in words if word in t)
|
||||
if score > best_score:
|
||||
best_label, best_score = label, score
|
||||
if best_score == 0:
|
||||
ar = image_features.get("aspect_ratio", 1.0)
|
||||
if ar > 1.3:
|
||||
best_label, best_score = "screenshot_web_or_app", 1
|
||||
else:
|
||||
best_label, best_score = "unknown_or_low_confidence", 0
|
||||
confidence = min(0.35 + 0.18 * best_score, 0.92) if best_score else 0.2
|
||||
if confidence < 0.45:
|
||||
best_label = "unknown_or_low_confidence"
|
||||
return {
|
||||
"label": best_label,
|
||||
"confidence": round(confidence, 3),
|
||||
"device": "CPU",
|
||||
"stage": "category_classification",
|
||||
"method": "rule_based_fallback",
|
||||
"npu_status": "not_configured_for_prototype_v1",
|
||||
"candidate_labels": CATEGORY_LABELS,
|
||||
}
|
||||
|
||||
|
||||
def extract_metadata(text: str) -> dict[str, Any]:
|
||||
dates = []
|
||||
for pat in DATE_PATTERNS:
|
||||
dates.extend(m.group(1) for m in pat.finditer(text))
|
||||
amounts = AMOUNT_RE.findall(text)
|
||||
flags = {
|
||||
"org_present": bool(re.search(r"\b(?:inc|llc|clinic|department|bank|insurance|store)\b", text, re.I)),
|
||||
"address_present": bool(re.search(r"\b\d{2,5}\s+[A-Za-z0-9 .]+\s+(?:st|street|ave|avenue|rd|road|blvd|drive|dr)\b", text, re.I)),
|
||||
"phone_present": bool(PHONE_RE.search(text)),
|
||||
"email_present": bool(EMAIL_RE.search(text)),
|
||||
"policy_or_account_id_present": bool(ACCOUNT_RE.search(text)),
|
||||
"identity_number_like_present": bool(SSN_LIKE_RE.search(text)),
|
||||
}
|
||||
return {
|
||||
"dates_count": len(set(dates)),
|
||||
"amounts_count": len(set(amounts)),
|
||||
"detected_entities": flags,
|
||||
"raw_values_redacted": True,
|
||||
}
|
||||
|
||||
|
||||
def call_embeddings(text: str, url: str, timeout: float) -> dict[str, Any]:
|
||||
if not text.strip():
|
||||
return {"used": False, "device": "NPU", "status": "skipped_no_text", "npu_busy_delta_us": 0}
|
||||
before = read_npu_busy()
|
||||
payload = json.dumps({"input": text[:2048], "purpose": "document"}).encode()
|
||||
req = urllib.request.Request(url, data=payload, headers={"Content-Type": "application/json"})
|
||||
t0 = time.perf_counter()
|
||||
try:
|
||||
with urllib.request.urlopen(req, timeout=timeout) as resp:
|
||||
body = resp.read(1024 * 1024)
|
||||
status = resp.status
|
||||
parsed = json.loads(body.decode())
|
||||
dim = None
|
||||
if isinstance(parsed, dict) and parsed.get("data"):
|
||||
emb = parsed["data"][0].get("embedding", [])
|
||||
dim = len(emb) if isinstance(emb, list) else None
|
||||
after = read_npu_busy()
|
||||
delta = (after - before) if before is not None and after is not None else None
|
||||
return {
|
||||
"used": True,
|
||||
"device": "NPU",
|
||||
"status": "ok" if status == 200 else f"http_{status}",
|
||||
"embedding_dim": dim,
|
||||
"wall_ms": round((time.perf_counter() - t0) * 1000, 2),
|
||||
"npu_busy_delta_us": delta,
|
||||
"verified_npu": bool(delta and delta > 0),
|
||||
"endpoint": "127.0.0.1:18817",
|
||||
}
|
||||
except (urllib.error.URLError, TimeoutError, json.JSONDecodeError) as exc:
|
||||
after = read_npu_busy()
|
||||
delta = (after - before) if before is not None and after is not None else None
|
||||
return {
|
||||
"used": False,
|
||||
"device": "NPU",
|
||||
"status": f"embedding_service_error:{type(exc).__name__}",
|
||||
"npu_busy_delta_us": delta,
|
||||
"verified_npu": False,
|
||||
"endpoint": "127.0.0.1:18817",
|
||||
}
|
||||
|
||||
|
||||
def needs_attention(text: str, embedding_result: dict[str, Any]) -> dict[str, Any]:
|
||||
t = text.lower()
|
||||
reasons = []
|
||||
for reason, words in ATTENTION_KEYWORDS.items():
|
||||
if any(word in t for word in words):
|
||||
reasons.append(reason)
|
||||
meta = extract_metadata(text)
|
||||
if meta["amounts_count"]:
|
||||
reasons.append("amount_due")
|
||||
if meta["dates_count"]:
|
||||
reasons.append("due_date_present")
|
||||
reasons = sorted(set(reasons))
|
||||
value = bool(reasons)
|
||||
confidence = min(0.45 + 0.1 * len(reasons), 0.9) if value else 0.35
|
||||
if embedding_result.get("verified_npu"):
|
||||
confidence = min(confidence + 0.05, 0.95)
|
||||
return {
|
||||
"value": value,
|
||||
"confidence": round(confidence, 3),
|
||||
"reasons": reasons or (["low_confidence"] if not text.strip() else []),
|
||||
"device": "NPU+CPU" if embedding_result.get("used") else "CPU",
|
||||
"stage": "needs_attention",
|
||||
"method": "NPU embedding verification + CPU rules" if embedding_result.get("used") else "CPU rules fallback",
|
||||
"embedding": embedding_result,
|
||||
}
|
||||
|
||||
|
||||
def infer_media_type(path: Path, is_pdf_page: bool = False) -> str:
|
||||
if is_pdf_page:
|
||||
return "pdf_page"
|
||||
mt, _ = mimetypes.guess_type(path.name)
|
||||
if path.suffix.lower() == ".pdf":
|
||||
return "pdf"
|
||||
if mt and mt.startswith("image/"):
|
||||
return "image"
|
||||
return "unknown"
|
||||
|
||||
|
||||
def triage_file(path_like: str | Path, options: TriageOptions | None = None) -> dict[str, Any]:
|
||||
options = options or TriageOptions()
|
||||
path = Path(path_like).expanduser()
|
||||
resolved = path.resolve()
|
||||
if not under_allowed_root(resolved, options.allowed_roots):
|
||||
raise ValueError(f"path is outside allowed roots: {path}")
|
||||
if not resolved.exists() or not resolved.is_file():
|
||||
raise FileNotFoundError(str(path))
|
||||
size = resolved.stat().st_size
|
||||
if size > MAX_FILE_BYTES:
|
||||
raise ValueError(f"file too large for prototype limit: {size} bytes")
|
||||
|
||||
file_hash = sha256_file(resolved)
|
||||
text, text_source = sidecar_text(resolved)
|
||||
pdf_text_status = None
|
||||
if resolved.suffix.lower() == ".pdf" and not text:
|
||||
text, pdf_text_status = extract_pdf_text(resolved, options.max_pages)
|
||||
text_source = pdf_text_status
|
||||
|
||||
pages: list[dict[str, Any]] = []
|
||||
render_error = None
|
||||
if not options.dry_run:
|
||||
images, render_error = load_image_pages(resolved, options.max_pages)
|
||||
else:
|
||||
images = []
|
||||
|
||||
if not images and options.dry_run:
|
||||
images = []
|
||||
elif not images:
|
||||
# Return a file-level record even if PDF rendering is unavailable.
|
||||
images = []
|
||||
|
||||
embedding_result = call_embeddings(text, options.embedding_url, options.timeout_seconds) if options.use_embeddings else {"used": False, "device": "NPU", "status": "disabled", "npu_busy_delta_us": 0, "verified_npu": False}
|
||||
attn = needs_attention(text, embedding_result)
|
||||
meta = extract_metadata(text)
|
||||
|
||||
if images:
|
||||
for idx, img in enumerate(images):
|
||||
features = normalize_for_hash_features(img)
|
||||
classification = classify_rule(text, features)
|
||||
pages.append({
|
||||
"page_index": idx,
|
||||
"media_type": infer_media_type(resolved, resolved.suffix.lower() == ".pdf"),
|
||||
"image": {"width": img.width, "height": img.height, "orientation": "portrait" if img.height >= img.width else "landscape", **features},
|
||||
"classification": classification,
|
||||
"needs_attention": attn,
|
||||
"metadata": meta,
|
||||
"ocr": {"available": bool(text), "quality": 0.7 if text else 0.0, "device": "CPU", "text_source": text_source},
|
||||
})
|
||||
else:
|
||||
classification = classify_rule(text, {"aspect_ratio": 1.0})
|
||||
pages.append({
|
||||
"page_index": 0,
|
||||
"media_type": infer_media_type(resolved, resolved.suffix.lower() == ".pdf"),
|
||||
"image": {"width": None, "height": None, "orientation": None, "render_error": render_error},
|
||||
"classification": classification,
|
||||
"needs_attention": attn,
|
||||
"metadata": meta,
|
||||
"ocr": {"available": bool(text), "quality": 0.7 if text else 0.0, "device": "CPU", "text_source": text_source},
|
||||
})
|
||||
|
||||
result: dict[str, Any] = {
|
||||
"file_id": f"sha256:{file_hash}",
|
||||
"source_path_basename": resolved.name,
|
||||
"media_type": infer_media_type(resolved),
|
||||
"file_size_bytes": size,
|
||||
"page_count": len(pages),
|
||||
"pages": pages,
|
||||
"processing_device_summary": {
|
||||
"file_intake": "CPU",
|
||||
"pdf_rendering": "CPU" if resolved.suffix.lower() == ".pdf" else "not_applicable",
|
||||
"image_category_classification": "CPU rule fallback (NPU model not configured in prototype v1)",
|
||||
"ocr_text_extraction": "CPU/local sidecar or optional local PDF text extractor",
|
||||
"needs_attention_embedding": "NPU via local :18817" if embedding_result.get("used") else "CPU fallback/no text",
|
||||
"metadata_extraction": "CPU",
|
||||
"npu_verified": bool(embedding_result.get("verified_npu")),
|
||||
"npu_busy_delta_us": embedding_result.get("npu_busy_delta_us"),
|
||||
},
|
||||
"privacy": {
|
||||
"external_uploads": False,
|
||||
"localhost_only_embedding_call": bool(options.use_embeddings),
|
||||
"raw_text_logged": False,
|
||||
"raw_values_redacted": True,
|
||||
"full_path_included": options.include_full_path,
|
||||
},
|
||||
"errors": [e for e in [render_error, pdf_text_status if pdf_text_status and not text else None] if e],
|
||||
}
|
||||
if options.include_full_path:
|
||||
result["source_path"] = str(resolved)
|
||||
if options.include_ocr_text:
|
||||
result["ocr_text"] = text
|
||||
return result
|
||||
|
||||
|
||||
def triage_batch(paths: list[str], options: TriageOptions | None = None) -> dict[str, Any]:
|
||||
items = []
|
||||
for p in paths:
|
||||
try:
|
||||
items.append({"ok": True, "result": triage_file(p, options)})
|
||||
except Exception as exc:
|
||||
items.append({"ok": False, "source_path_basename": Path(p).name, "error": type(exc).__name__, "message": str(exc)})
|
||||
return {"ok": all(item["ok"] for item in items), "files": items, "generated_at": dt.datetime.now(dt.UTC).isoformat()}
|
||||
|
||||
|
||||
def cli() -> int:
|
||||
parser = argparse.ArgumentParser(description="Local document/image triage prototype")
|
||||
parser.add_argument("paths", nargs="+", help="local image/PDF paths")
|
||||
parser.add_argument("--allowed-root", action="append", default=[], help="allowed local root; defaults to cwd")
|
||||
parser.add_argument("--max-pages", type=int, default=3)
|
||||
parser.add_argument("--include-ocr-text", action="store_true")
|
||||
parser.add_argument("--include-full-path", action="store_true")
|
||||
parser.add_argument("--no-embeddings", action="store_true", help="disable local NPU embedding call")
|
||||
parser.add_argument("--dry-run", action="store_true")
|
||||
parser.add_argument("--pretty", action="store_true")
|
||||
args = parser.parse_args()
|
||||
roots = [Path(p) for p in args.allowed_root] if args.allowed_root else [Path.cwd()]
|
||||
options = TriageOptions(
|
||||
max_pages=args.max_pages,
|
||||
include_ocr_text=args.include_ocr_text,
|
||||
dry_run=args.dry_run,
|
||||
use_embeddings=not args.no_embeddings,
|
||||
allowed_roots=roots,
|
||||
include_full_path=args.include_full_path,
|
||||
)
|
||||
out = triage_batch(args.paths, options)
|
||||
print(json.dumps(out, indent=2 if args.pretty else None, sort_keys=True))
|
||||
return 0 if out["ok"] else 2
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(cli())
|
||||
Reference in New Issue
Block a user