From fed8b629c7758a677925cb59e88112452aa5b9e2 Mon Sep 17 00:00:00 2001 From: William Valentin Date: Tue, 23 Sep 2025 10:15:09 -0700 Subject: [PATCH] chore: initialize repository scaffolding --- .gitignore | 17 ++ INSTRUCTIONS.md | 692 ++++++++++++++++++++++++++++++++++++++++++++++++ README.md | 3 + 3 files changed, 712 insertions(+) create mode 100644 .gitignore create mode 100644 INSTRUCTIONS.md create mode 100644 README.md diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..555103d --- /dev/null +++ b/.gitignore @@ -0,0 +1,17 @@ +__pycache__/ +*.py[cod] +*.egg-info/ +.env +.venv/ +.env.* +.pytest_cache/ +.coverage +htmlcov/ +node_modules/ +.next/ +dist/ +build/ +.DS_Store +.idea/ +.vscode/ +coverage/ diff --git a/INSTRUCTIONS.md b/INSTRUCTIONS.md new file mode 100644 index 0000000..86b06d6 --- /dev/null +++ b/INSTRUCTIONS.md @@ -0,0 +1,692 @@ +# SPEC-1 – Classy Perplexity‑style News Aggregator (Raspberry Pi 5 K8s) + +## Background + +You want a Perplexity‑style web app that aggregates news from a defined pool of reference websites and presents results in a classy, attractive, highly responsive UI. The target runtime is a Raspberry Pi 5 Kubernetes cluster, so the system must be lightweight, ARM64‑friendly, and resilient to node churn or SD‑card fragility. The product should feel like a modern AI assistant for news discovery: fast search, crisp summaries, clear source attributions, and mobile‑first ergonomics. + +Initial working assumptions (to be confirmed): + +* Content sources are a curated list of reputable outlets and blogs that permit aggregation with proper linking and snippet‑length quoting. +* We will index headlines, metadata, and short excerpts; full‑text storage will be minimized or avoided unless licensed. +* The app will support semantic search + conversational Q\&A over the indexed corpus, with citations to original articles. +* Real‑time(ish) freshness target: new articles discoverable within 2–5 minutes of publication. +* UI aims to echo Perplexity’s clean card layout, with source badges, inline citations, and a composer panel for queries. +* Deployment must fit on 2–4 ARM64 nodes, using lightweight containers and a small replicated datastore. + +## Requirements + +**Scope for MVP**: Start with **Reuters** as the single source. Use official **RSS/Atom feeds and daily sitemaps** when available; gracefully fall back to HTML scraping for sections without feeds, storing only metadata/snippets with links. Freshness target 2–5 minutes. UI mirrors Perplexity’s card+chat layout with inline citations. + +### MoSCoW + +**Must‑have** + +* Aggregate from Reuters via RSS/Atom + sitemaps; fallback HTML scraper with robots.txt compliance toggle. +* ARM64‑ready containers deployable on Raspberry Pi 5 K8s (k3s or MicroK8s). +* Ingest pipeline with deduplication, canonical URL normalization, and rate‑limit/backoff. +* Index headlines, authors, timestamps, topics, short excerpt (<= 320 chars), and source URL. +* Full‑text search over stored fields; semantic search embeddings over titles+snippets. +* Summarization and on‑page Q\&A with **clear citations** to source URLs. +* Classy, responsive UI with Perplexity‑style query composer, results cards, and source badges. +* Observability: structured logs, basic metrics (ingest latency, queue depth, 95p response), and alerting. +* Legal safety rails: configurable snippet length, per‑domain robots policy, and kill‑switch per source. + +**Should‑have** + +* Topic taxonomy and tags (World, Business, Tech, etc.). +* Incremental sitemap polling (by date) + change‑list RSS polling with jitter to avoid burst load. +* Reader mode extraction (readability‑style) used **only for summarization** in memory, not stored. +* Caching layer (HTTP + summary cache) to keep Raspberry Pi costs low. +* Multi‑node HA for index and queue; rolling updates. + +**Could‑have** + +* User accounts for saved searches and daily digests. +* Multi‑source expansion via declarative YAML for new sites. +* Related‑story clustering and timeline views. +* Basic mobile PWA installability and offline read‑later for snippets. + +**Won’t‑have (MVP)** + +* Paywalled content bypassing or full‑text storage of copyrighted articles. +* Personalized recommendations or email digests. +* Editorial curation tooling beyond tags and pinning. + +## Method + +### High‑level architecture + +```plantuml +@startuml +skinparam componentStyle rectangle +skinparam shadowing false +skinparam ArrowColor #888 +skinparam DefaultFontName Inter + +rectangle "k0s Cluster (ARM64 Raspberry Pi 5)" as K8S { + node "Namespace: news" as NS { + [Ingest Scheduler] +(CronJobs) + [Feed+Sitemap Poller] +(FastAPI Worker) + [HTML Scraper] +(Worker, Trafilatura) + [Normalizer/Dedupe] +(Worker) + [Embedder] +(Worker -> OpenAI embeddings) + [Summarizer] +(Worker -> OpenAI gpt-4o-mini) + + database "PostgreSQL + pgvector" as PG + [Redis] +(Cache + Queue) + + [API Gateway] +(FastAPI) + [Web UI] +(Next.js, Tailwind, shadcn) + } +} + +[Feed+Sitemap Poller] --> [HTML Scraper] +[HTML Scraper] --> [Normalizer/Dedupe] +[Normalizer/Dedupe] --> PG +[Embedder] --> PG +[Summarizer] --> PG + +[Ingest Scheduler] --> [Feed+Sitemap Poller] +[Embedder] --> [OpenAI Embeddings API] +[Summarizer] --> [OpenAI Chat Completions] + +[API Gateway] --> PG +[API Gateway] --> Redis +[Web UI] --> [API Gateway] +@enduml +``` + +**Why these choices (MVP):** + +* **Source**: Start with **Reuters** using news sitemaps (with pagination parameters) and RSS; where feeds don’t exist, scrape respectfully with robots awareness. +* **Storage**: **PostgreSQL + pgvector** keeps the stack compact (one DB for metadata, text search, and vectors). Postgres full‑text covers keyword search; pgvector powers semantic search. +* **Workers**: Python **FastAPI** workers using **Trafilatura** for robust article extraction and metadata parsing. **Redis** as the lightweight queue/cache (Dramatiq or RQ). +* **Summaries/Q\&A**: On‑demand summaries and answer synthesis via **gpt‑4o‑mini** with **inline citations**. Embeddings via **text‑embedding‑3‑small**. Both accessed through API keys/secrets in Kubernetes. +* **UI**: **Next.js 14 App Router**, Tailwind + shadcn for a Perplexity‑style, low‑latency interface. +* **k0s**: ARM64‑friendly. Use **nginx‑ingress** for HTTP routing, with optional **HAProxy Ingress** for TCP/advanced policies. + +### Data model (PostgreSQL) + +```sql +-- Sources (static for MVP) +CREATE TABLE sources ( + id SERIAL PRIMARY KEY, + name TEXT NOT NULL UNIQUE, -- e.g., 'Reuters' + base_url TEXT NOT NULL, -- e.g., https://www.reuters.com + rss_urls TEXT[] NOT NULL DEFAULT '{}', + sitemap_urls TEXT[] NOT NULL DEFAULT '{}', + robots_txt TEXT, + enabled BOOLEAN NOT NULL DEFAULT true +); + +-- Raw fetch jobs (observability + retries) +CREATE TABLE fetch_jobs ( + id BIGSERIAL PRIMARY KEY, + source_id INT REFERENCES sources(id), + url TEXT NOT NULL, + kind TEXT NOT NULL CHECK (kind IN ('rss','sitemap','article')), + status TEXT NOT NULL CHECK (status IN ('queued','fetched','parsed','failed')), + http_status INT, + etag TEXT, + last_modified TIMESTAMPTZ, + attempts INT NOT NULL DEFAULT 0, + error TEXT, + created_at TIMESTAMPTZ NOT NULL DEFAULT now(), + updated_at TIMESTAMPTZ NOT NULL DEFAULT now() +); +CREATE INDEX ON fetch_jobs (status, created_at); + +-- Canonical articles (no copyrighted full text stored) +CREATE TABLE articles ( + id BIGSERIAL PRIMARY KEY, + source_id INT REFERENCES sources(id) NOT NULL, + canonical_url TEXT NOT NULL, + url_hash BYTEA NOT NULL, -- SHA-256 of canonical_url + title TEXT NOT NULL, + author TEXT, + category TEXT, -- World, Business, Tech, etc. + published_at TIMESTAMPTZ, + fetched_at TIMESTAMPTZ NOT NULL DEFAULT now(), + snippet TEXT, -- <= 320 chars, from feed/lede + summary TEXT, -- model-generated abstract + image_url TEXT, + language TEXT DEFAULT 'en', + UNIQUE (source_id, url_hash) +); +CREATE INDEX ON articles (published_at DESC); +CREATE INDEX ON articles USING GIN (to_tsvector('english', coalesce(title,'') || ' ' || coalesce(snippet,''))); + +-- Embeddings for semantic search (title+snippet) +CREATE EXTENSION IF NOT EXISTS vector; +CREATE TABLE article_embeddings ( + article_id BIGINT PRIMARY KEY REFERENCES articles(id) ON DELETE CASCADE, + embedding vector(1536) -- dimension for text-embedding-3-small +); +CREATE INDEX ON article_embeddings USING ivfflat (embedding vector_cosine_ops); + +-- Tags and mapping (optional but handy) +CREATE TABLE tags ( + id SERIAL PRIMARY KEY, + name TEXT UNIQUE +); +CREATE TABLE article_tags ( + article_id BIGINT REFERENCES articles(id) ON DELETE CASCADE, + tag_id INT REFERENCES tags(id) ON DELETE CASCADE, + PRIMARY KEY (article_id, tag_id) +); +``` + +### Ingestion flow + +1. **Discovery** + +* Poll **RSS/Atom** endpoints with ETag/Last‑Modified to minimize bandwidth. +* Poll **news sitemaps** using incremental parameters (e.g., `from=` offsets when supported). Maintain per‑endpoint cursors. +* For sections without feeds, enqueue **HTML pages** discovered from site index pages (rate‑limited) and respect `robots.txt` (configurable). + +2. **Fetch & Extract** + +* HTTP client with retry + exponential backoff and per‑host concurrency caps (e.g., 2–4). Respect `Cache-Control` where present. +* Use **Trafilatura** with `favor_precision=true` to extract main content for **in‑memory summarization only**; do not persist full text. +* Generate a **canonical URL** (resolve redirects, strip tracking params) and compute `url_hash`. + +3. **Normalize & Deduplicate** + +* If `(source_id, url_hash)` exists, skip insert; else create `articles` row with metadata and **snippet** (<=320 chars). +* Classify category using rule‑based hints (URL path, RSS category) with a fallback lightweight classifier. + +4. **Summaries & Embeddings** + +* Create a short **summary** (60–90 words, neutral tone) with inline citation marker `[1]` → canonical URL. +* Compute **embedding** on `(title + " + " + snippet)` and upsert into `article_embeddings`. + +5. **Indexing & Cache** + +* Postgres GIN index supports keyword search; pgvector handles ANN semantic search. +* Cache hot queries and summaries in Redis for 5–15 minutes. + +### API design (FastAPI) + +* `GET /v1/search?q=&mode=hybrid&page=` — Hybrid search (keyword + vector rerank), returns cards with title, snippet, badges, and citations. +* `GET /v1/articles/{id}` — Metadata + summary. +* `POST /v1/ask` — Conversational answer over top‑k retrieved articles, always with citations. +* `POST /v1/feedback` — Thumbs up/down and optional comment. + +### UI flows (Next.js 14) + +* **Home**: Center composer, query suggestions, trending topics. +* **Results**: Perplexity‑style answer at top with source chips; below, cards for each cited article; sticky composer for follow‑ups. +* **Interactions**: Cmd/Ctrl‑K global search, `?` keyboard help, skeleton loaders, optimistic UI. + +### Kubernetes (k0s) deployment sketch + +* **Namespaces**: `news`, `news-observe`. +* **Ingress**: `nginx-ingress` for HTTPS; optional parallel **HAProxy Ingress** for TCP/advanced use. Certs via cert‑manager + DNS‑01 or HTTP‑01. +* **Deployments** (ARM64 images): + + * `api` (FastAPI, Uvicorn Gunicorn): 2 replicas, HPA on CPU 60% & p95 latency SLI. + * `web` (Next.js): 2 replicas, static export (optional) behind Node adapter. + * `worker` (ingest/summarize/embed): 2–4 replicas, separate queues for `poll`, `scrape`, `summ`, `embed`. + * `postgres` (Bitnami ARM64) with persistent volume; enable `pgvector` extension. + * `redis` (Bitnami ARM64) for cache/queue. +* **RBAC/Secrets**: Kubernetes Secrets for API keys; service accounts per deployment. +* **Resources (starting)**: api 200m/512Mi; web 100m/256Mi; worker 300m/1Gi; redis 50m/256Mi; postgres 250m/2Gi. +* **Autoscaling**: HPA + VPA recommendations; cluster metrics via kube‑metrics‑server. + +### Ranking & answer synthesis + +* **Hybrid search**: BM25 (Postgres full‑text) for recall → take top 50; compute cosine similarity on vectors → rerank → top 8. +* **Answer**: Prompt model with the top 6 snippets + titles and URLs; enforce **citation after each sentence** where evidence exists. Refuse to answer beyond source material. + +### Rate limiting & ethics + +* Per‑source QPS caps (e.g., 0.5–1 rps) and adaptive backoff. +* Honor robots.txt by default; switchable per your policy. Always link prominently to original. +* Snippets limited; no storage of full article text. + +## Implementation + +### 0) Repo layout + +``` +news-agg/ + apps/ + api/ # FastAPI (Python 3.11) + web/ # Next.js 14 UI + workers/ # poll/scrape/summarize/embed (FastAPI tasks + RQ/Dramatiq) + deploy/ + base/ # K8s Kustomize base (namespaces, RBAC, NetworkPolicies) + overlays/ + pi-prod/ + kustomization.yaml + postgres.yaml + redis.yaml + api.yaml + web.yaml + workers.yaml + cron-poller.yaml + ingress-nginx.yaml + ingress-haproxy.yaml (optional) + secrets.example.yaml + ops/ + helm-values/ + bitnami-postgresql.yaml + bitnami-redis.yaml + scripts/ + build.sh # multi-arch docker buildx + db_migrate.sql # tables + pgvector +``` + +### 1) Container images (ARM64) + +* **Python base**: `python:3.11-slim` + `uv`/`pip-tools`; compile wheels at build time. +* **Node**: `node:18-alpine` → `next build` then run with `node` or export static. +* Use **`docker buildx`** to produce `linux/arm64` images. Example: + +``` +docker buildx build --platform linux/arm64 -t registry/pi/news-api:0.1 -f apps/api/Dockerfile --push . +``` + +**apps/api/Dockerfile** (snippet) + +```Dockerfile +FROM python:3.11-slim +RUN apt-get update && apt-get install -y build-essential libpq-dev && rm -rf /var/lib/apt/lists/* +WORKDIR /app +COPY apps/api/pyproject.toml apps/api/uv.lock ./ +RUN pip install -U pip && pip install uv +RUN uv pip install --system -r requirements.txt || true +COPY apps/api/ . +CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"] +``` + +### 2) k0s cluster prep (once) + +* Install **nginx‑ingress** and (optionally) **HAProxy Ingress** via manifests/Helm. +* Install **cert-manager** for TLS if exposing publicly. +* Add **metrics‑server** for HPA and **KEDA** (optional) for queue-based scaling. + +### 3) Datastores + +**PostgreSQL (Bitnami, pgvector)** + +```yaml +# deploy/overlays/pi-prod/postgres.yaml +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: { name: pgdata, namespace: news } +spec: + accessModes: ["ReadWriteOnce"] + resources: { requests: { storage: 20Gi } } +--- +apiVersion: v1 +kind: ConfigMap +metadata: { name: pg-init, namespace: news } +data: + 00-init.sql: | + CREATE EXTENSION IF NOT EXISTS vector; + -- migrations applied by apps on startup too +--- +apiVersion: helm.cattle.io/v1 +kind: HelmChart +metadata: { name: pg, namespace: kube-system } +spec: + chart: oci://registry-1.docker.io/bitnamicharts/postgresql + targetNamespace: news + version: 15.x.x + valuesContent: | + image: + repository: bitnami/postgresql + tag: 15-debian-12 + primary: + extraVolumes: + - name: pg-init + configMap: { name: pg-init } + extraVolumeMounts: + - name: pg-init + mountPath: /docker-entrypoint-initdb.d + persistence: + existingClaim: pgdata + auth: + username: news + password: ${PG_PASSWORD} + database: news +``` + +**Redis (Bitnami)** + +```yaml +# deploy/overlays/pi-prod/redis.yaml +apiVersion: helm.cattle.io/v1 +kind: HelmChart +metadata: { name: redis, namespace: kube-system } +spec: + chart: oci://registry-1.docker.io/bitnamicharts/redis + targetNamespace: news + version: 18.x.x + valuesContent: | + architecture: standalone + auth: + enabled: false +``` + +### 4) Secrets & Config + +```yaml +# deploy/overlays/pi-prod/secrets.example.yaml (copy to secrets.yaml and fill) +apiVersion: v1 +kind: Secret +metadata: { name: app-secrets, namespace: news } +type: Opaque +data: + OPENAI_API_KEY: + APP_SIGNING_KEY: +--- +apiVersion: v1 +kind: ConfigMap +metadata: { name: app-config, namespace: news } +data: + SNIPPET_MAX: "320" + SOURCES: | + - name: Reuters + base_url: https://www.reuters.com + rss: + - https://www.reuters.com/rss/worldNews + sitemaps: + - https://www.reuters.com/sitemap_news.xml + robots_policy: honor + RANKING: "hybrid" +``` + +### 5) Workers (poll, scrape, summarize, embed) + +```yaml +# deploy/overlays/pi-prod/workers.yaml +apiVersion: apps/v1 +kind: Deployment +metadata: { name: workers, namespace: news } +spec: + replicas: 3 + selector: { matchLabels: { app: workers } } + template: + metadata: { labels: { app: workers } } + spec: + containers: + - name: workers + image: registry/pi/news-workers:0.1 + envFrom: + - secretRef: { name: app-secrets } + - configMapRef: { name: app-config } + env: + - { name: REDIS_URL, value: redis://redis-master.news.svc.cluster.local:6379/0 } + - { name: DATABASE_URL, value: postgresql://news:$(PG_PASSWORD)@pg-postgresql.news.svc.cluster.local:5432/news } + resources: + requests: { cpu: "300m", memory: "1Gi" } + limits: { cpu: "900m", memory: "2Gi" } + livenessProbe: { httpGet: { path: /healthz, port: 8080 }, initialDelaySeconds: 15 } + readinessProbe:{ httpGet: { path: /readyz, port: 8080 }, initialDelaySeconds: 5 } +``` + +**Cron: feed/sitemap polling** + +```yaml +apiVersion: batch/v1 +kind: CronJob +metadata: { name: poller, namespace: news } +spec: + schedule: "*/2 * * * *" # every 2 minutes + jobTemplate: + spec: + template: + spec: + restartPolicy: OnFailure + containers: + - name: poll + image: registry/pi/news-workers:0.1 + args: ["poll"] + envFrom: + - secretRef: { name: app-secrets } + - configMapRef: { name: app-config } +``` + +### 6) API service (FastAPI) + +```yaml +# deploy/overlays/pi-prod/api.yaml +apiVersion: apps/v1 +kind: Deployment +metadata: { name: api, namespace: news } +spec: + replicas: 2 + selector: { matchLabels: { app: api } } + template: + metadata: { labels: { app: api } } + spec: + containers: + - name: api + image: registry/pi/news-api:0.1 + ports: [{ containerPort: 8080 }] + envFrom: + - secretRef: { name: app-secrets } + - configMapRef: { name: app-config } + env: + - { name: REDIS_URL, value: redis://redis-master.news.svc.cluster.local:6379/0 } + - { name: DATABASE_URL, value: postgresql://news:$(PG_PASSWORD)@pg-postgresql.news.svc.cluster.local:5432/news } + resources: + requests: { cpu: "200m", memory: "512Mi" } + limits: { cpu: "600m", memory: "1Gi" } +--- +apiVersion: v1 +kind: Service +metadata: { name: api, namespace: news } +spec: + selector: { app: api } + ports: + - name: http + port: 80 + targetPort: 8080 +``` + +**FastAPI search (sketch)** + +```python +# apps/api/search.py +from pgvector.psycopg import register_vector +import psycopg, numpy as np + +EMBED_DIM = 1536 + +def hybrid_search(conn, q, k=8): + with conn.cursor() as cur: + # 1) Embedding + v = embed(q) # call OpenAI embeddings + # 2) Keyword recall + cur.execute(""" + SELECT id, title, snippet, canonical_url, + ts_rank(to_tsvector('english', coalesce(title,'')||' '||coalesce(snippet,'')), plainto_tsquery(%s)) AS rank + FROM articles + WHERE to_tsvector('english', coalesce(title,'')||' '||coalesce(snippet,'')) @@ plainto_tsquery(%s) + ORDER BY rank DESC + LIMIT 50 + """, (q, q)) + rows = cur.fetchall() + ids = [r[0] for r in rows] or [-1] + # 3) Vector rerank + cur.execute(""" + SELECT a.id, a.title, a.snippet, a.canonical_url, + 1 - (e.embedding <=> %s::vector) AS sim + FROM articles a + JOIN article_embeddings e ON e.article_id = a.id + WHERE a.id = ANY(%s) + ORDER BY sim DESC LIMIT %s + """, (np.array(v), ids, k)) + return cur.fetchall() +``` + +### 7) Web UI (Next.js 14) + +* App Router, Tailwind, shadcn/ui. Server actions call API. +* Components: `Composer`, `AnswerBox` (with sentence-level citations), `ResultCard`, `SourceChip`. +* Add **PWA** manifest + basic offline cache for shell. + +### 8) Ingress (nginx primary, HAProxy optional) + +```yaml +# deploy/overlays/pi-prod/ingress-nginx.yaml +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + name: news + namespace: news + annotations: + kubernetes.io/ingress.class: nginx + nginx.ingress.kubernetes.io/proxy-body-size: "1m" +spec: + tls: + - hosts: [news.local] + secretName: news-tls + rules: + - host: news.local + http: + paths: + - path: / + pathType: Prefix + backend: { service: { name: web, port: { number: 80 } } } + - path: /v1 + pathType: Prefix + backend: { service: { name: api, port: { number: 80 } } } +``` + +### 9) Observability + +* **Logging**: JSON logs via `structlog` (API/workers), `stdout` aggregated by k0s. +* **Metrics**: Prometheus scraping (use `prometheus-fastapi-instrumentator`), Grafana dashboards. +* **Tracing**: OpenTelemetry SDK exporting to Tempo/OTLP (optional). +* SLOs: p95 search < 600ms (warm); ingest freshness p95 < 5 min. + +### 10) CI/CD (GitHub Actions) + +* Build multi-arch images with `setup-buildx-action`, push to your registry. +* Deploy via `kubectl` or ArgoCD (optional). Gate with manual approval. + +### 11) Prompts & safety rails + +* **Summary prompt**: 60–90 words, neutral tone, forbid speculation, 1–2 citations with URLs. +* **Answer prompt**: Use only retrieved snippets; every sentence claims must cite `[n]`. If insufficient evidence, say so. +* **Guardrails**: Max 6 articles per answer; truncate inputs to token budget. + +### 12) Performance knobs (Raspberry Pi friendly) + +* Enable HTTP caching (ETag/If‑Modified‑Since). +* Redis cache TTL 10m for hot queries. +* Per‑host concurrency: 2 (scraper); global QPS: 0.5–1 for Reuters. +* Use gzip/deflate when fetching; strip images when scraping. + +### 13) Data retention + +* Keep `articles` 30 days rolling (configurable). Older rows archived to `articles_archive` without embeddings. + +### 14) Security + +* NetworkPolicies: only API/worker → DB/Redis; web → API; deny egress by default except OpenAI domains. +* Secrets from Kubernetes; rotate quarterly. Read‑only service accounts for web. +* TLS everywhere; CSP headers on web. + +## Milestones + +**MVP timeline: 2 weeks (LAN only, no TLS)** + +### Week 1 — Foundations & ingest + +* **Day 1–2**: Cluster prep (k0s), namespaces, nginx Ingress (HTTP only), metrics‑server. Registry access + buildx pipeline. +* **Day 3**: Postgres (pgvector) + Redis live; migrations applied. +* **Day 4**: Workers scaffolded (poll, scrape) with Reuters RSS + sitemap pollers; ETag/Last‑Modified implemented; robots policy set to *honor*. +* **Day 5**: Normalizer/dedupe; article schema writes; minimal admin page to view ingest logs. + +**Exit criteria**: Reuters articles flowing into DB with title/snippet/category/published\_at; p95 freshness under 10 min. + +### Week 2 — Search, summaries, UI polish + +* **Day 6**: Embeddings worker + index (pgvector ivfflat). Hybrid search in API. +* **Day 7**: Summarizer worker; store 60–90 word summaries; cache. +* **Day 8**: Next.js UI (composer, answer box, cards, source chips). Basic keyboard nav. +* **Day 9**: Observability: Prometheus scrape + Grafana dashboard; SLOs wired. +* **Day 10**: Hardening (quotas, retries), data retention job; smoke tests; cut **MVP v0.1.0**. + +**Exit criteria**: Query returns an answer with citations in < 800ms warm path; summaries stable; LAN users can search and read cited sources. + +## Gathering Results + +### KPIs (Primary) + +* **Freshness (p95)**: time from article publication → available in search. Target: ≤ 5 minutes; stretch ≤ 2 minutes. +* **Answer Accuracy**: % of answer sentences that have at least one valid citation to the retrieved set. Target: ≥ 95%. + +### KPIs (Secondary) + +* **Coverage**: % of Reuters articles discovered vs. listed in sitemaps over last 24h. Target: ≥ 98%. +* **Latency (p95)**: query → first contentful paint (UI) and API response time. Targets: API ≤ 600ms warm; UI FCP ≤ 1.5s on LAN. +* **Stability**: worker error rate < 1%; scraper retry rate < 10%. + +### Instrumentation + +* **Prometheus metrics** + + * `ingest_freshness_seconds{source=…}` (histogram) + * `ingest_discovered_total{kind= rss|sitemap|scrape}` + * `scrape_http_status_total{code=…}` + * `search_latency_seconds` (histogram) + * `answer_citation_coverage_ratio` (gauge) + * `worker_queue_depth{queue=…}` +* **Structured logs** (JSON): include `trace_id`, `job_id`, and normalized URL. +* **Dashboards (Grafana)**: Freshness, Search Latency, Coverage vs Sitemap, Error budget burn. + +### Accuracy evaluation + +* **Automatic**: + + * Parse answer into sentences; verify each sentence has at least one citation. + * Check that citation URLs match the top‑k retrieved set and that snippets contain supporting tokens (simple ROUGE‑like overlap). + * Flag low‑evidence sentences for review. +* **Human review** (1–2×/week): + + * 50 sampled answers; label: correct / partially supported / unsupported / off‑topic. + * Compute **hallucination rate** (unsupported sentences ÷ total) and track trend. + +### Feedback loop + +* UI **thumbs up/down** with optional comment saved to `feedback` table: + +```sql +CREATE TABLE feedback ( + id BIGSERIAL PRIMARY KEY, + query TEXT NOT NULL, + answer_id TEXT, + verdict TEXT CHECK (verdict IN ('up','down')), + comment TEXT, + created_at TIMESTAMPTZ NOT NULL DEFAULT now() +); +``` + +* Downvotes auto‑create a JIRA/GitHub issue if `answer_citation_coverage_ratio < 0.9`. + +### Experimentation + +* **Prompt variants** A/B via header flag in API (e.g., `x-prompt=v2`). +* **Ranking tweaks**: switch BM25 weight vs vector weight; record NDCG\@10 on labeled queries. + +### Post‑mortems & safety + +* Blameless post‑mortem for any incident where hallucination rate > 10% in a day or freshness p95 > 10 min for >1h. +* Daily data retention job verified; no full‑text persists beyond in‑memory summary context. + diff --git a/README.md b/README.md new file mode 100644 index 0000000..e7aaf09 --- /dev/null +++ b/README.md @@ -0,0 +1,3 @@ +# Classy Perplexity-style News Aggregator + +This repository houses the scaffolding for a Perplexity-inspired Reuters news aggregator designed for Raspberry Pi 5 clusters. See `INSTRUCTIONS.md` for the full specification and implementation guidelines.