# Pathrule Pattern: RAG & Embeddings (1.0.0)
# ::pathrule:package:rag-embeddings

### [RULE] Chunk on semantic boundaries with overlap; never embed whole documents  (path: /src/rag)
<!-- scope: folder | priority: high | advisory -->

An embedding is one point in space. Embed a whole document and that point is the average of every topic it covers, so it matches everything weakly and nothing well.

- Split into chunks bounded by a token budget (roughly 256-512 tokens for prose; tune to your content) and align splits to natural boundaries: headings, paragraphs, code blocks, sentences. Do not split mid-sentence or mid-code-block.
- Keep a small overlap (about 10-20%) between adjacent chunks so a fact that straddles a boundary survives in at least one chunk.
- Store provenance on every chunk: source id, title, section/heading, and position. This metadata is what makes filtering and citation possible later.
- Match chunk size to the retrieval job: smaller chunks for precise fact lookup, larger for narrative context. One global size rarely fits every document type.

---

### [RULE] Filter retrieval by metadata and a similarity floor; never dump a blind top-k  (path: /src/rag)
<!-- scope: folder | priority: high | strict -->

Top-k always returns k rows, even when nothing relevant exists. Without a floor, irrelevant chunks become "context" and the model dutifully reasons over garbage.

- Apply a similarity threshold and drop matches below it. If nothing clears the floor, return no context and let the model say it does not know, rather than padding the prompt with weak matches.
- Always pre-filter by access and scope metadata (tenant id, user, document permissions, language, recency) in the SQL `WHERE`, not after retrieval. Skipping this is how one tenant's vectors end up in another tenant's answer.
- Cap the context you assemble by token budget, not just row count; trim the lowest-scoring chunks first and keep room for the system prompt and the answer.
- Always include each chunk's source metadata in the assembled context so the model can cite and the user can verify.

---

### [MEMORY] Vector store: pgvector with an HNSW index  (path: /src/rag)

We keep vectors in Postgres with the `pgvector` extension rather than standing up a separate vector database. Chunks live in a normal table beside their metadata, so a retrieval query is one SQL statement that filters and ranks together.

- Store the embedding as a `vector(N)` column where N is the model's dimension. Keep chunk text, the vector, and provenance metadata in the same row.
- Index with HNSW (`USING hnsw (embedding vector_cosine_ops)`) for fast approximate nearest-neighbour search at scale; it beats IVFFlat on recall/latency for most workloads and needs no training step.
- Match the index operator class to the distance metric your embedding model expects (cosine is the common default). The query operator (`<=>` for cosine) must match the index, or it falls back to a sequential scan.
- Filter then rank: put tenant/permission predicates in the `WHERE` and order by distance. A partial or composite index aligned to your common filters keeps it fast.

See /src/rag for the chunking and retrieval-filter rules and the re-ranking memory.

---

### [MEMORY] Pin the embedding model and its dimension  (path: /src/rag)

Every vector in an index was produced by one specific embedding model. Vectors from two different models are not comparable, so the model is not a tunable setting - it is part of the schema.

- Pin the exact embedding model id and dimension in config, alongside the table that stores its vectors. Treat a model change like a breaking migration.
- Changing the model (or its dimension) means re-embedding the entire corpus into a new column/table and cutting over; you cannot mix old and new vectors in one index and get meaningful distances.
- Embed queries with the same model and the same normalization/instruction prefix you used for the documents. An asymmetric setup (one model for docs, another for queries) silently tanks recall.
- Batch embedding calls during ingestion to amortize latency and cost, and record the model id on each row so you can audit and re-embed selectively.

See /src/rag for the pgvector store memory.

---

### [MEMORY] Add a re-ranking pass when precision matters  (path: /src/rag)

Vector similarity is fast but coarse: it ranks by embedding distance, which is not the same as relevance to the actual question. A re-ranking pass fixes the ordering where it counts.

- Retrieve a wider candidate set first (e.g. top 20-50 by vector distance), then re-rank those candidates with a reranker/cross-encoder that scores each chunk against the query directly, and keep the top few.
- Re-ranking trades latency and cost for precision. Add it when answers are subtly off despite the right documents being in the index; skip it for latency-critical or low-stakes lookups.
- Hybrid retrieval - combining vector search with keyword/full-text (BM25-style) search and merging the results - often beats either alone, especially for exact terms, names, and codes that embeddings blur.
- Measure before and after with a small labelled question/answer set. "It feels better" is not a retrieval metric; recall@k and answer correctness are.

See /src/rag for the retrieval-filter rule and the embedding-model memory.

---

### [SKILL] rag-ingestion-pipeline  (path: /)

---
name: rag-ingestion-pipeline
description: Checklist for building or changing a retrieval-augmented generation ingestion and retrieval pipeline. Run when adding a corpus, changing chunking, or debugging irrelevant retrieval.
---

# RAG ingestion & retrieval pipeline

## Ingestion
- [ ] Load source, then chunk on semantic boundaries within a token budget, with 10-20% overlap; no mid-sentence/mid-code splits.
- [ ] Attach provenance metadata to every chunk: source id, title, section, position, tenant/permissions.
- [ ] Embed with the pinned model + dimension; same model and prefix used for docs and queries; batch the calls.
- [ ] Upsert chunk text + `vector(N)` + metadata into Postgres; create/refresh the HNSW index with the operator class matching the distance metric.
- [ ] Record the embedding model id on each row so a model change can re-embed selectively.

## Retrieval
- [ ] Pre-filter by tenant/permission/scope metadata in the `WHERE` clause.
- [ ] Rank by vector distance using the operator that matches the index; apply a similarity floor and drop weak matches.
- [ ] If precision matters, over-fetch candidates and re-rank (and consider hybrid keyword + vector).
- [ ] Assemble context within a token budget, trimming lowest-scoring chunks first; include source metadata for citation.
- [ ] If nothing clears the floor, return no context rather than padding the prompt.

## Verify
- [ ] Evaluate on a small labelled Q/A set (recall@k, answer correctness) before and after changes; don't ship on vibes.