- Add FastAPI application with complete router structure - Implement search, articles, ask, feedback, and health endpoints - Add comprehensive Pydantic schemas for API requests/responses - Include stub service implementations for all business logic - Add full test suite with pytest-asyncio integration - Configure conventional commits enforcement via git hooks - Add project documentation and contribution guidelines - Support both OpenAI and Gemini LLM integration options
720 lines
26 KiB
Markdown
720 lines
26 KiB
Markdown
# SPEC-1 – Classy Perplexity‑style News Aggregator (Raspberry Pi 5 K8s)
|
||
|
||
## Background
|
||
|
||
You want a Perplexity‑style web app that aggregates news from a defined pool of reference websites and presents results in a classy, attractive, highly responsive UI. The target runtime is a Raspberry Pi 5 Kubernetes cluster, so the system must be lightweight, ARM64‑friendly, and resilient to node churn or SD‑card fragility. The product should feel like a modern AI assistant for news discovery: fast search, crisp summaries, clear source attributions, and mobile‑first ergonomics.
|
||
|
||
Initial working assumptions (to be confirmed):
|
||
|
||
* Content sources are a curated list of reputable outlets and blogs that permit aggregation with proper linking and snippet‑length quoting.
|
||
* We will index headlines, metadata, and short excerpts; full‑text storage will be minimized or avoided unless licensed.
|
||
* The app will support semantic search + conversational Q\&A over the indexed corpus, with citations to original articles.
|
||
* Real‑time(ish) freshness target: new articles discoverable within 2–5 minutes of publication.
|
||
* UI aims to echo Perplexity’s clean card layout, with source badges, inline citations, and a composer panel for queries.
|
||
* Deployment must fit on 2–4 ARM64 nodes, using lightweight containers and a small replicated datastore.
|
||
|
||
## Requirements
|
||
|
||
**Scope for MVP**: Start with **Reuters** as the single source. Use official **RSS/Atom feeds and daily sitemaps** when available; gracefully fall back to HTML scraping for sections without feeds, storing only metadata/snippets with links. Freshness target 2–5 minutes. UI mirrors Perplexity’s card+chat layout with inline citations.
|
||
|
||
### MoSCoW
|
||
|
||
**Must‑have**
|
||
|
||
* Aggregate from Reuters via RSS/Atom + sitemaps; fallback HTML scraper with robots.txt compliance toggle.
|
||
* ARM64‑ready containers deployable on Raspberry Pi 5 K8s (k3s or MicroK8s).
|
||
* Ingest pipeline with deduplication, canonical URL normalization, and rate‑limit/backoff.
|
||
* Index headlines, authors, timestamps, topics, short excerpt (<= 320 chars), and source URL.
|
||
* Full‑text search over stored fields; semantic search embeddings over titles+snippets.
|
||
* Summarization and on‑page Q\&A with **clear citations** to source URLs.
|
||
* Classy, responsive UI with Perplexity‑style query composer, results cards, and source badges.
|
||
* Observability: structured logs, basic metrics (ingest latency, queue depth, 95p response), and alerting.
|
||
* Legal safety rails: configurable snippet length, per‑domain robots policy, and kill‑switch per source.
|
||
|
||
**Should‑have**
|
||
|
||
* Topic taxonomy and tags (World, Business, Tech, etc.).
|
||
* Incremental sitemap polling (by date) + change‑list RSS polling with jitter to avoid burst load.
|
||
* Reader mode extraction (readability‑style) used **only for summarization** in memory, not stored.
|
||
* Caching layer (HTTP + summary cache) to keep Raspberry Pi costs low.
|
||
* Multi‑node HA for index and queue; rolling updates.
|
||
|
||
**Could‑have**
|
||
|
||
* User accounts for saved searches and daily digests.
|
||
* Multi‑source expansion via declarative YAML for new sites.
|
||
* Related‑story clustering and timeline views.
|
||
* Basic mobile PWA installability and offline read‑later for snippets.
|
||
|
||
**Won’t‑have (MVP)**
|
||
|
||
* Paywalled content bypassing or full‑text storage of copyrighted articles.
|
||
* Personalized recommendations or email digests.
|
||
* Editorial curation tooling beyond tags and pinning.
|
||
|
||
## Method
|
||
|
||
### High‑level architecture
|
||
|
||
```plantuml
|
||
@startuml
|
||
skinparam componentStyle rectangle
|
||
skinparam shadowing false
|
||
skinparam ArrowColor #888
|
||
skinparam DefaultFontName Inter
|
||
|
||
rectangle "k0s Cluster (ARM64 Raspberry Pi 5)" as K8S {
|
||
node "Namespace: news" as NS {
|
||
[Ingest Scheduler]
|
||
(CronJobs)
|
||
[Feed+Sitemap Poller]
|
||
(FastAPI Worker)
|
||
[HTML Scraper]
|
||
(Worker, Trafilatura)
|
||
[Normalizer/Dedupe]
|
||
(Worker)
|
||
[Embedder]
|
||
(Worker -> OpenAI embeddings/Gemini flash)
|
||
[Summarizer]
|
||
(Worker -> OpenAI gpt-4o-mini/Gemini pro)
|
||
|
||
database "PostgreSQL + pgvector" as PG
|
||
[Redis]
|
||
(Cache + Queue)
|
||
|
||
[API Gateway]
|
||
(FastAPI)
|
||
[Web UI]
|
||
(Next.js, Tailwind, shadcn)
|
||
}
|
||
}
|
||
|
||
[Feed+Sitemap Poller] --> [HTML Scraper]
|
||
[HTML Scraper] --> [Normalizer/Dedupe]
|
||
[Normalizer/Dedupe] --> PG
|
||
[Embedder] --> PG
|
||
[Summarizer] --> PG
|
||
|
||
[Ingest Scheduler] --> [Feed+Sitemap Poller]
|
||
[Embedder] --> [OpenAI Embeddings API/Gemini API]
|
||
[Summarizer] --> [OpenAI Chat Completions/Gemini API]
|
||
|
||
[API Gateway] --> PG
|
||
[API Gateway] --> Redis
|
||
[Web UI] --> [API Gateway]
|
||
@enduml
|
||
```
|
||
|
||
**Why these choices (MVP):**
|
||
|
||
* **Source**: Start with **Reuters** using news sitemaps (with pagination parameters) and RSS; where feeds don’t exist, scrape respectfully with robots awareness.
|
||
* **Storage**: **PostgreSQL + pgvector** keeps the stack compact (one DB for metadata, text search, and vectors). Postgres full‑text covers keyword search; pgvector powers semantic search.
|
||
* **Workers**: Python **FastAPI** workers using **Trafilatura** for robust article extraction and metadata parsing. **Redis** as the lightweight queue/cache (Dramatiq or RQ).
|
||
* **Summaries/Q\&A**: On‑demand summaries and answer synthesis via **gpt‑4o‑mini or Gemini pro** with **inline citations**. Embeddings via **text‑embedding‑3‑small or Gemini flash**. Both accessed through API keys/secrets in Kubernetes.
|
||
* **UI**: **Next.js 14 App Router**, Tailwind + shadcn for a Perplexity‑style, low‑latency interface.
|
||
* **k0s**: ARM64‑friendly. Use **nginx‑ingress** for HTTP routing, with optional **HAProxy Ingress** for TCP/advanced policies.
|
||
|
||
### Data model (PostgreSQL)
|
||
|
||
```sql
|
||
-- Sources (static for MVP)
|
||
CREATE TABLE sources (
|
||
id SERIAL PRIMARY KEY,
|
||
name TEXT NOT NULL UNIQUE, -- e.g., 'Reuters'
|
||
base_url TEXT NOT NULL, -- e.g., https://www.reuters.com
|
||
rss_urls TEXT[] NOT NULL DEFAULT '{}',
|
||
sitemap_urls TEXT[] NOT NULL DEFAULT '{}',
|
||
robots_txt TEXT,
|
||
enabled BOOLEAN NOT NULL DEFAULT true
|
||
);
|
||
|
||
-- Raw fetch jobs (observability + retries)
|
||
CREATE TABLE fetch_jobs (
|
||
id BIGSERIAL PRIMARY KEY,
|
||
source_id INT REFERENCES sources(id),
|
||
url TEXT NOT NULL,
|
||
kind TEXT NOT NULL CHECK (kind IN ('rss','sitemap','article')),
|
||
status TEXT NOT NULL CHECK (status IN ('queued','fetched','parsed','failed')),
|
||
http_status INT,
|
||
etag TEXT,
|
||
last_modified TIMESTAMPTZ,
|
||
attempts INT NOT NULL DEFAULT 0,
|
||
error TEXT,
|
||
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
|
||
);
|
||
CREATE INDEX ON fetch_jobs (status, created_at);
|
||
|
||
-- Canonical articles (no copyrighted full text stored)
|
||
CREATE TABLE articles (
|
||
id BIGSERIAL PRIMARY KEY,
|
||
source_id INT REFERENCES sources(id) NOT NULL,
|
||
canonical_url TEXT NOT NULL,
|
||
url_hash BYTEA NOT NULL, -- SHA-256 of canonical_url
|
||
title TEXT NOT NULL,
|
||
author TEXT,
|
||
category TEXT, -- World, Business, Tech, etc.
|
||
published_at TIMESTAMPTZ,
|
||
fetched_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||
snippet TEXT, -- <= 320 chars, from feed/lede
|
||
summary TEXT, -- model-generated abstract
|
||
image_url TEXT,
|
||
language TEXT DEFAULT 'en',
|
||
UNIQUE (source_id, url_hash)
|
||
);
|
||
CREATE INDEX ON articles (published_at DESC);
|
||
CREATE INDEX ON articles USING GIN (to_tsvector('english', coalesce(title,'') || ' ' || coalesce(snippet,'')));
|
||
|
||
-- Embeddings for semantic search (title+snippet)
|
||
CREATE EXTENSION IF NOT EXISTS vector;
|
||
CREATE TABLE article_embeddings (
|
||
article_id BIGINT PRIMARY KEY REFERENCES articles(id) ON DELETE CASCADE,
|
||
embedding vector(1536) -- dimension for text-embedding-3-small or Gemini flash
|
||
);
|
||
CREATE INDEX ON article_embeddings USING ivfflat (embedding vector_cosine_ops);
|
||
|
||
-- Tags and mapping (optional but handy)
|
||
CREATE TABLE tags (
|
||
id SERIAL PRIMARY KEY,
|
||
name TEXT UNIQUE
|
||
);
|
||
CREATE TABLE article_tags (
|
||
article_id BIGINT REFERENCES articles(id) ON DELETE CASCADE,
|
||
tag_id INT REFERENCES tags(id) ON DELETE CASCADE,
|
||
PRIMARY KEY (article_id, tag_id)
|
||
);
|
||
```
|
||
|
||
### Ingestion flow
|
||
|
||
1. **Discovery**
|
||
|
||
* Poll **RSS/Atom** endpoints with ETag/Last‑Modified to minimize bandwidth.
|
||
* Poll **news sitemaps** using incremental parameters (e.g., `from=` offsets when supported). Maintain per‑endpoint cursors.
|
||
* For sections without feeds, enqueue **HTML pages** discovered from site index pages (rate‑limited) and respect `robots.txt` (configurable).
|
||
|
||
2. **Fetch & Extract**
|
||
|
||
* HTTP client with retry + exponential backoff and per‑host concurrency caps (e.g., 2–4). Respect `Cache-Control` where present.
|
||
* Use **Trafilatura** with `favor_precision=true` to extract main content for **in‑memory summarization only**; do not persist full text.
|
||
* Generate a **canonical URL** (resolve redirects, strip tracking params) and compute `url_hash`.
|
||
|
||
3. **Normalize & Deduplicate**
|
||
|
||
* If `(source_id, url_hash)` exists, skip insert; else create `articles` row with metadata and **snippet** (<=320 chars).
|
||
* Classify category using rule‑based hints (URL path, RSS category) with a fallback lightweight classifier.
|
||
|
||
4. **Summaries & Embeddings**
|
||
|
||
* Create a short **summary** (60–90 words, neutral tone) with inline citation marker `[1]` → canonical URL.
|
||
* Compute **embedding** on `(title + "
|
||
" + snippet)` and upsert into `article_embeddings`.
|
||
|
||
5. **Indexing & Cache**
|
||
|
||
* Postgres GIN index supports keyword search; pgvector handles ANN semantic search.
|
||
* Cache hot queries and summaries in Redis for 5–15 minutes.
|
||
|
||
### API design (FastAPI)
|
||
|
||
* `GET /v1/search?q=&mode=hybrid&page=` — Hybrid search (keyword + vector rerank), returns cards with title, snippet, badges, and citations.
|
||
* `GET /v1/articles/{id}` — Metadata + summary.
|
||
* `POST /v1/ask` — Conversational answer over top‑k retrieved articles, always with citations.
|
||
* `POST /v1/feedback` — Thumbs up/down and optional comment.
|
||
|
||
### UI flows (Next.js 14)
|
||
|
||
* **Home**: Center composer, query suggestions, trending topics.
|
||
* **Results**: Perplexity‑style answer at top with source chips; below, cards for each cited article; sticky composer for follow‑ups.
|
||
* **Interactions**: Cmd/Ctrl‑K global search, `?` keyboard help, skeleton loaders, optimistic UI.
|
||
|
||
### Kubernetes (k0s) deployment sketch
|
||
|
||
* **Namespaces**: `news`, `news-observe`.
|
||
* **Ingress**: `nginx-ingress` for HTTPS; optional parallel **HAProxy Ingress** for TCP/advanced use. Certs via cert‑manager + DNS‑01 or HTTP‑01.
|
||
* **Deployments** (ARM64 images):
|
||
|
||
* `api` (FastAPI, Uvicorn Gunicorn): 2 replicas, HPA on CPU 60% & p95 latency SLI.
|
||
* `web` (Next.js): 2 replicas, static export (optional) behind Node adapter.
|
||
* `worker` (ingest/summarize/embed): 2–4 replicas, separate queues for `poll`, `scrape`, `summ`, `embed`.
|
||
* `postgres` (Bitnami ARM64) with persistent volume; enable `pgvector` extension.
|
||
* `redis` (Bitnami ARM64) for cache/queue.
|
||
* **RBAC/Secrets**: Kubernetes Secrets for API keys; service accounts per deployment.
|
||
* **Resources (starting)**: api 200m/512Mi; web 100m/256Mi; worker 300m/1Gi; redis 50m/256Mi; postgres 250m/2Gi.
|
||
* **Autoscaling**: HPA + VPA recommendations; cluster metrics via kube‑metrics‑server.
|
||
|
||
### Ranking & answer synthesis
|
||
|
||
* **Hybrid search**: BM25 (Postgres full‑text) for recall → take top 50; compute cosine similarity on vectors → rerank → top 8.
|
||
* **Answer**: Prompt model with the top 6 snippets + titles and URLs; enforce **citation after each sentence** where evidence exists. Refuse to answer beyond source material.
|
||
|
||
### Rate limiting & ethics
|
||
|
||
* Per‑source QPS caps (e.g., 0.5–1 rps) and adaptive backoff.
|
||
* Honor robots.txt by default; switchable per your policy. Always link prominently to original.
|
||
* Snippets limited; no storage of full article text.
|
||
|
||
## Implementation
|
||
|
||
### 0) Repo layout
|
||
|
||
```
|
||
news-agg/
|
||
apps/
|
||
api/ # FastAPI (Python 3.11)
|
||
web/ # Next.js 14 UI
|
||
workers/ # poll/scrape/summarize/embed (FastAPI tasks + RQ/Dramatiq)
|
||
deploy/
|
||
base/ # K8s Kustomize base (namespaces, RBAC, NetworkPolicies)
|
||
overlays/
|
||
pi-prod/
|
||
kustomization.yaml
|
||
postgres.yaml
|
||
redis.yaml
|
||
api.yaml
|
||
web.yaml
|
||
workers.yaml
|
||
cron-poller.yaml
|
||
ingress-nginx.yaml
|
||
ingress-haproxy.yaml (optional)
|
||
secrets.example.yaml
|
||
ops/
|
||
helm-values/
|
||
bitnami-postgresql.yaml
|
||
bitnami-redis.yaml
|
||
scripts/
|
||
build.sh # multi-arch docker buildx
|
||
db_migrate.sql # tables + pgvector
|
||
```
|
||
|
||
### 1) Container images (ARM64)
|
||
|
||
* **Python base**: `python:3.11-slim` + `uv`/`pip-tools`; compile wheels at build time.
|
||
* **Node**: `node:18-alpine` → `next build` then run with `node` or export static.
|
||
* Use **`docker buildx`** to produce `linux/arm64` images. Example:
|
||
|
||
```
|
||
docker buildx build --platform linux/arm64 -t registry/pi/news-api:0.1 -f apps/api/Dockerfile --push .
|
||
```
|
||
|
||
**apps/api/Dockerfile** (snippet)
|
||
|
||
```Dockerfile
|
||
FROM python:3.11-slim
|
||
RUN apt-get update && apt-get install -y build-essential libpq-dev && rm -rf /var/lib/apt/lists/*
|
||
WORKDIR /app
|
||
COPY apps/api/pyproject.toml apps/api/uv.lock ./
|
||
RUN pip install -U pip && pip install uv
|
||
RUN uv pip install --system -r requirements.txt || true
|
||
COPY apps/api/ .
|
||
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]
|
||
```
|
||
|
||
### 2) k0s cluster prep (once)
|
||
|
||
* Install **nginx‑ingress** and (optionally) **HAProxy Ingress** via manifests/Helm.
|
||
* Install **cert-manager** for TLS if exposing publicly.
|
||
* Add **metrics‑server** for HPA and **KEDA** (optional) for queue-based scaling.
|
||
|
||
### 3) Datastores
|
||
|
||
**PostgreSQL (Bitnami, pgvector)**
|
||
|
||
```yaml
|
||
# deploy/overlays/pi-prod/postgres.yaml
|
||
apiVersion: v1
|
||
kind: PersistentVolumeClaim
|
||
metadata: { name: pgdata, namespace: news }
|
||
spec:
|
||
accessModes: ["ReadWriteOnce"]
|
||
resources: { requests: { storage: 20Gi } }
|
||
---
|
||
apiVersion: v1
|
||
kind: ConfigMap
|
||
metadata: { name: pg-init, namespace: news }
|
||
data:
|
||
00-init.sql: |
|
||
CREATE EXTENSION IF NOT EXISTS vector;
|
||
-- migrations applied by apps on startup too
|
||
---
|
||
apiVersion: helm.cattle.io/v1
|
||
kind: HelmChart
|
||
metadata: { name: pg, namespace: kube-system }
|
||
spec:
|
||
chart: oci://registry-1.docker.io/bitnamicharts/postgresql
|
||
targetNamespace: news
|
||
version: 15.x.x
|
||
valuesContent: |
|
||
image:
|
||
repository: bitnami/postgresql
|
||
tag: 15-debian-12
|
||
primary:
|
||
extraVolumes:
|
||
- name: pg-init
|
||
configMap: { name: pg-init }
|
||
extraVolumeMounts:
|
||
- name: pg-init
|
||
mountPath: /docker-entrypoint-initdb.d
|
||
persistence:
|
||
existingClaim: pgdata
|
||
auth:
|
||
username: news
|
||
password: ${PG_PASSWORD}
|
||
database: news
|
||
```
|
||
|
||
**Redis (Bitnami)**
|
||
|
||
```yaml
|
||
# deploy/overlays/pi-prod/redis.yaml
|
||
apiVersion: helm.cattle.io/v1
|
||
kind: HelmChart
|
||
metadata: { name: redis, namespace: kube-system }
|
||
spec:
|
||
chart: oci://registry-1.docker.io/bitnamicharts/redis
|
||
targetNamespace: news
|
||
version: 18.x.x
|
||
valuesContent: |
|
||
architecture: standalone
|
||
auth:
|
||
enabled: false
|
||
```
|
||
|
||
### 4) Secrets & Config
|
||
|
||
```yaml
|
||
# deploy/overlays/pi-prod/secrets.example.yaml (copy to secrets.yaml and fill)
|
||
apiVersion: v1
|
||
kind: Secret
|
||
metadata: { name: app-secrets, namespace: news }
|
||
type: Opaque
|
||
data:
|
||
OPENAI_API_KEY: <base64>
|
||
GEMINI_API_KEY: <base64>
|
||
APP_SIGNING_KEY: <base64>
|
||
---
|
||
apiVersion: v1
|
||
kind: ConfigMap
|
||
metadata: { name: app-config, namespace: news }
|
||
data:
|
||
SNIPPET_MAX: "320"
|
||
SOURCES: |
|
||
- name: Reuters
|
||
base_url: https://www.reuters.com
|
||
rss:
|
||
- https://www.reuters.com/rss/worldNews
|
||
sitemaps:
|
||
- https://www.reuters.com/sitemap_news.xml
|
||
robots_policy: honor
|
||
RANKING: "hybrid"
|
||
```
|
||
|
||
### 5) Workers (poll, scrape, summarize, embed)
|
||
|
||
```yaml
|
||
# deploy/overlays/pi-prod/workers.yaml
|
||
apiVersion: apps/v1
|
||
kind: Deployment
|
||
metadata: { name: workers, namespace: news }
|
||
spec:
|
||
replicas: 3
|
||
selector: { matchLabels: { app: workers } }
|
||
template:
|
||
metadata: { labels: { app: workers } }
|
||
spec:
|
||
containers:
|
||
- name: workers
|
||
image: registry/pi/news-workers:0.1
|
||
envFrom:
|
||
- secretRef: { name: app-secrets }
|
||
- configMapRef: { name: app-config }
|
||
env:
|
||
- { name: REDIS_URL, value: redis://redis-master.news.svc.cluster.local:6379/0 }
|
||
- { name: DATABASE_URL, value: postgresql://news:$(PG_PASSWORD)@pg-postgresql.news.svc.cluster.local:5432/news }
|
||
resources:
|
||
requests: { cpu: "300m", memory: "1Gi" }
|
||
limits: { cpu: "900m", memory: "2Gi" }
|
||
livenessProbe: { httpGet: { path: /healthz, port: 8080 }, initialDelaySeconds: 15 }
|
||
readinessProbe:{ httpGet: { path: /readyz, port: 8080 }, initialDelaySeconds: 5 }
|
||
```
|
||
|
||
**Cron: feed/sitemap polling**
|
||
|
||
```yaml
|
||
apiVersion: batch/v1
|
||
kind: CronJob
|
||
metadata: { name: poller, namespace: news }
|
||
spec:
|
||
schedule: "*/2 * * * *" # every 2 minutes
|
||
jobTemplate:
|
||
spec:
|
||
template:
|
||
spec:
|
||
restartPolicy: OnFailure
|
||
containers:
|
||
- name: poll
|
||
image: registry/pi/news-workers:0.1
|
||
args: ["poll"]
|
||
envFrom:
|
||
- secretRef: { name: app-secrets }
|
||
- configMapRef: { name: app-config }
|
||
```
|
||
|
||
### 6) API service (FastAPI)
|
||
|
||
```yaml
|
||
# deploy/overlays/pi-prod/api.yaml
|
||
apiVersion: apps/v1
|
||
kind: Deployment
|
||
metadata: { name: api, namespace: news }
|
||
spec:
|
||
replicas: 2
|
||
selector: { matchLabels: { app: api } }
|
||
template:
|
||
metadata: { labels: { app: api } }
|
||
spec:
|
||
containers:
|
||
- name: api
|
||
image: registry/pi/news-api:0.1
|
||
ports: [{ containerPort: 8080 }]
|
||
envFrom:
|
||
- secretRef: { name: app-secrets }
|
||
- configMapRef: { name: app-config }
|
||
env:
|
||
- { name: REDIS_URL, value: redis://redis-master.news.svc.cluster.local:6379/0 }
|
||
- { name: DATABASE_URL, value: postgresql://news:$(PG_PASSWORD)@pg-postgresql.news.svc.cluster.local:5432/news }
|
||
resources:
|
||
requests: { cpu: "200m", memory: "512Mi" }
|
||
limits: { cpu: "600m", memory: "1Gi" }
|
||
---
|
||
apiVersion: v1
|
||
kind: Service
|
||
metadata: { name: api, namespace: news }
|
||
spec:
|
||
selector: { app: api }
|
||
ports:
|
||
- name: http
|
||
port: 80
|
||
targetPort: 8080
|
||
```
|
||
|
||
**FastAPI search (sketch)**
|
||
|
||
```python
|
||
# apps/api/search.py
|
||
from pgvector.psycopg import register_vector
|
||
import psycopg, numpy as np
|
||
|
||
EMBED_DIM = 1536
|
||
|
||
def hybrid_search(conn, q, k=8):
|
||
with conn.cursor() as cur:
|
||
# 1) Embedding
|
||
v = embed(q) # call OpenAI embeddings or Gemini flash
|
||
# 2) Keyword recall
|
||
cur.execute("""
|
||
SELECT id, title, snippet, canonical_url,
|
||
ts_rank(to_tsvector('english', coalesce(title,'')||' '||coalesce(snippet,'')), plainto_tsquery(%s)) AS rank
|
||
FROM articles
|
||
WHERE to_tsvector('english', coalesce(title,'')||' '||coalesce(snippet,'')) @@ plainto_tsquery(%s)
|
||
ORDER BY rank DESC
|
||
LIMIT 50
|
||
""", (q, q))
|
||
rows = cur.fetchall()
|
||
ids = [r[0] for r in rows] or [-1]
|
||
# 3) Vector rerank
|
||
cur.execute("""
|
||
SELECT a.id, a.title, a.snippet, a.canonical_url,
|
||
1 - (e.embedding <=> %s::vector) AS sim
|
||
FROM articles a
|
||
JOIN article_embeddings e ON e.article_id = a.id
|
||
WHERE a.id = ANY(%s)
|
||
ORDER BY sim DESC LIMIT %s
|
||
""", (np.array(v), ids, k))
|
||
return cur.fetchall()
|
||
```
|
||
|
||
### 7) Web UI (Next.js 14)
|
||
|
||
* App Router, Tailwind, shadcn/ui. Server actions call API.
|
||
* Components: `Composer`, `AnswerBox` (with sentence-level citations), `ResultCard`, `SourceChip`.
|
||
* Add **PWA** manifest + basic offline cache for shell.
|
||
|
||
### 8) Ingress (nginx primary, HAProxy optional)
|
||
|
||
```yaml
|
||
# deploy/overlays/pi-prod/ingress-nginx.yaml
|
||
apiVersion: networking.k8s.io/v1
|
||
kind: Ingress
|
||
metadata:
|
||
name: news
|
||
namespace: news
|
||
annotations:
|
||
kubernetes.io/ingress.class: nginx
|
||
nginx.ingress.kubernetes.io/proxy-body-size: "1m"
|
||
spec:
|
||
tls:
|
||
- hosts: [news.local]
|
||
secretName: news-tls
|
||
rules:
|
||
- host: news.local
|
||
http:
|
||
paths:
|
||
- path: /
|
||
pathType: Prefix
|
||
backend: { service: { name: web, port: { number: 80 } } }
|
||
- path: /v1
|
||
pathType: Prefix
|
||
backend: { service: { name: api, port: { number: 80 } } }
|
||
```
|
||
|
||
### 9) Observability
|
||
|
||
* **Logging**: JSON logs via `structlog` (API/workers), `stdout` aggregated by k0s.
|
||
* **Metrics**: Prometheus scraping (use `prometheus-fastapi-instrumentator`), Grafana dashboards.
|
||
* **Tracing**: OpenTelemetry SDK exporting to Tempo/OTLP (optional).
|
||
* SLOs: p95 search < 600ms (warm); ingest freshness p95 < 5 min.
|
||
|
||
### 10) CI/CD (GitHub Actions)
|
||
|
||
* Build multi-arch images with `setup-buildx-action`, push to your registry.
|
||
* Deploy via `kubectl` or ArgoCD (optional). Gate with manual approval.
|
||
|
||
### 11) Prompts & safety rails
|
||
|
||
* **Summary prompt**: 60–90 words, neutral tone, forbid speculation, 1–2 citations with URLs.
|
||
* **Answer prompt**: Use only retrieved snippets; every sentence claims must cite `[n]`. If insufficient evidence, say so.
|
||
* **Guardrails**: Max 6 articles per answer; truncate inputs to token budget.
|
||
|
||
## Gemini LLM Integration
|
||
|
||
As an alternative to OpenAI models, this project supports Google's Gemini LLM for both embeddings and conversational tasks:
|
||
|
||
### Available Models
|
||
- **gemini-2.5-flash**: Lightweight model optimized for fast responses and high throughput
|
||
- **gemini-2.5-pro**: Advanced "thinking" model with enhanced reasoning capabilities
|
||
|
||
### Command Usage
|
||
Use the following commands to interact with Gemini models:
|
||
|
||
```bash
|
||
# For fast, lightweight responses (embeddings, quick summaries)
|
||
gemini --model gemini-2.5-flash -p "<PROMPT>"
|
||
|
||
# For complex reasoning and detailed analysis (conversational answers)
|
||
gemini --model gemini-2.5-pro -p "<PROMPT>"
|
||
```
|
||
|
||
### Integration Notes
|
||
- Gemini models can be used as drop-in replacements for OpenAI equivalents
|
||
- Flash model recommended for embeddings worker (text-embedding-3-small equivalent)
|
||
- Pro model recommended for summarizer worker (gpt-4o-mini equivalent)
|
||
- Configure via GEMINI_API_KEY in Kubernetes secrets alongside OPENAI_API_KEY
|
||
- Network policies should allow egress to generativelanguage.googleapis.com
|
||
|
||
### 12) Performance knobs (Raspberry Pi friendly)
|
||
|
||
* Enable HTTP caching (ETag/If‑Modified‑Since).
|
||
* Redis cache TTL 10m for hot queries.
|
||
* Per‑host concurrency: 2 (scraper); global QPS: 0.5–1 for Reuters.
|
||
* Use gzip/deflate when fetching; strip images when scraping.
|
||
|
||
### 13) Data retention
|
||
|
||
* Keep `articles` 30 days rolling (configurable). Older rows archived to `articles_archive` without embeddings.
|
||
|
||
### 14) Security
|
||
|
||
* NetworkPolicies: only API/worker → DB/Redis; web → API; deny egress by default except OpenAI and Gemini domains (api.openai.com, generativelanguage.googleapis.com).
|
||
* Secrets from Kubernetes; rotate quarterly. Read‑only service accounts for web. Include both OPENAI_API_KEY and GEMINI_API_KEY in secret management.
|
||
* TLS everywhere; CSP headers on web.
|
||
|
||
## Milestones
|
||
|
||
**MVP timeline: 2 weeks (LAN only, no TLS)**
|
||
|
||
### Week 1 — Foundations & ingest
|
||
|
||
* **Day 1–2**: Cluster prep (k0s), namespaces, nginx Ingress (HTTP only), metrics‑server. Registry access + buildx pipeline.
|
||
* **Day 3**: Postgres (pgvector) + Redis live; migrations applied.
|
||
* **Day 4**: Workers scaffolded (poll, scrape) with Reuters RSS + sitemap pollers; ETag/Last‑Modified implemented; robots policy set to *honor*.
|
||
* **Day 5**: Normalizer/dedupe; article schema writes; minimal admin page to view ingest logs.
|
||
|
||
**Exit criteria**: Reuters articles flowing into DB with title/snippet/category/published\_at; p95 freshness under 10 min.
|
||
|
||
### Week 2 — Search, summaries, UI polish
|
||
|
||
* **Day 6**: Embeddings worker + index (pgvector ivfflat). Hybrid search in API.
|
||
* **Day 7**: Summarizer worker; store 60–90 word summaries; cache.
|
||
* **Day 8**: Next.js UI (composer, answer box, cards, source chips). Basic keyboard nav.
|
||
* **Day 9**: Observability: Prometheus scrape + Grafana dashboard; SLOs wired.
|
||
* **Day 10**: Hardening (quotas, retries), data retention job; smoke tests; cut **MVP v0.1.0**.
|
||
|
||
**Exit criteria**: Query returns an answer with citations in < 800ms warm path; summaries stable; LAN users can search and read cited sources.
|
||
|
||
## Gathering Results
|
||
|
||
### KPIs (Primary)
|
||
|
||
* **Freshness (p95)**: time from article publication → available in search. Target: ≤ 5 minutes; stretch ≤ 2 minutes.
|
||
* **Answer Accuracy**: % of answer sentences that have at least one valid citation to the retrieved set. Target: ≥ 95%.
|
||
|
||
### KPIs (Secondary)
|
||
|
||
* **Coverage**: % of Reuters articles discovered vs. listed in sitemaps over last 24h. Target: ≥ 98%.
|
||
* **Latency (p95)**: query → first contentful paint (UI) and API response time. Targets: API ≤ 600ms warm; UI FCP ≤ 1.5s on LAN.
|
||
* **Stability**: worker error rate < 1%; scraper retry rate < 10%.
|
||
|
||
### Instrumentation
|
||
|
||
* **Prometheus metrics**
|
||
|
||
* `ingest_freshness_seconds{source=…}` (histogram)
|
||
* `ingest_discovered_total{kind= rss|sitemap|scrape}`
|
||
* `scrape_http_status_total{code=…}`
|
||
* `search_latency_seconds` (histogram)
|
||
* `answer_citation_coverage_ratio` (gauge)
|
||
* `worker_queue_depth{queue=…}`
|
||
* **Structured logs** (JSON): include `trace_id`, `job_id`, and normalized URL.
|
||
* **Dashboards (Grafana)**: Freshness, Search Latency, Coverage vs Sitemap, Error budget burn.
|
||
|
||
### Accuracy evaluation
|
||
|
||
* **Automatic**:
|
||
|
||
* Parse answer into sentences; verify each sentence has at least one citation.
|
||
* Check that citation URLs match the top‑k retrieved set and that snippets contain supporting tokens (simple ROUGE‑like overlap).
|
||
* Flag low‑evidence sentences for review.
|
||
* **Human review** (1–2×/week):
|
||
|
||
* 50 sampled answers; label: correct / partially supported / unsupported / off‑topic.
|
||
* Compute **hallucination rate** (unsupported sentences ÷ total) and track trend.
|
||
|
||
### Feedback loop
|
||
|
||
* UI **thumbs up/down** with optional comment saved to `feedback` table:
|
||
|
||
```sql
|
||
CREATE TABLE feedback (
|
||
id BIGSERIAL PRIMARY KEY,
|
||
query TEXT NOT NULL,
|
||
answer_id TEXT,
|
||
verdict TEXT CHECK (verdict IN ('up','down')),
|
||
comment TEXT,
|
||
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
|
||
);
|
||
```
|
||
|
||
* Downvotes auto‑create a JIRA/GitHub issue if `answer_citation_coverage_ratio < 0.9`.
|
||
|
||
### Experimentation
|
||
|
||
* **Prompt variants** A/B via header flag in API (e.g., `x-prompt=v2`).
|
||
* **Ranking tweaks**: switch BM25 weight vs vector weight; record NDCG\@10 on labeled queries.
|
||
|
||
### Post‑mortems & safety
|
||
|
||
* Blameless post‑mortem for any incident where hallucination rate > 10% in a day or freshness p95 > 10 min for >1h.
|
||
* Daily data retention job verified; no full‑text persists beyond in‑memory summary context.
|
||
|