Retrieval-Augmented Generation

Precision retrieval
for STEM knowledge

A production-grade RAG pipeline engineered for science, technology, engineering, and mathematics corpora. Combines dense vector search, cross-encoder reranking, and frontier LLM generation for grounded, citation-backed technical answers.

ChromaDB Dense Embeddings Cross-Encoder Reranker Claude API OpenRouter Image Extraction
USER QUERY "Navier-Stokes equations..." EMBEDDING MODEL Query → Dense Vector ℝ⁷⁶⁸ CHROMADB VECTOR STORE ANN Search · Top-K Candidates · Cosine STEM Index TEXT CHUNKS Top-10 retrieved IMAGE CHUNKS Visual context CROSS-ENCODER RERANKER Re-scores · Selects Top-3 · Filters low-conf LLM GENERATION Claude / OpenRouter + STEM Prompt
7+
Document Formats
768d
Embedding Dimension
2-Stage
Retrieval Pipeline
<3
Context Chunks to LLM
System Design

Layered Architecture

A modular, three-layer design that cleanly separates ingestion, retrieval, and generation concerns — enabling independent optimisation at each stage.

INGESTION RETRIEVAL GENERATION DOCUMENTS PDF · DOCX · TXT PPTX · XLSX · HTML ingestion.py Parse · Chunk · OCR Image Extraction embeddings.py Dense Encoder Vectors [N×768] vectorstore.py ChromaDB Index Persist + Load chroma_db/ Persistent Storage On-disk vectors extracted_images/ Figures · Diagrams Equations · Charts USER QUERY STEM Question retriever.py ANN Query · Top-K Fetch RERANKER Cross-Encoder · Score Sort CONTEXT BUILDER Top-3 Chunks + Images prompt.txt STEM Template llm.py Claude API OpenRouter Fallback config Model Selection API Keys · Params app.py Main Orchestrator API Server · Query Loop GROUNDED ANSWER With source citations Hallucination-minimised LEGEND Core module Data store Support module Data flow backend/ contains: embeddings.py · ingestion.py · llm.py · retriever.py · vectorstore.py · __init__.py
Processing Stages

The RAG Pipeline

From raw document ingestion to grounded LLM generation — each stage is purpose-built for technical STEM content.

01
📄

Document Parse

ingestion.py reads every supported format, extracts text, tables, and images page-by-page

02
✂️

Chunking

Text split into overlapping chunks preserving equations, tables, and STEM notation

03

Embed

embeddings.py encodes chunks into dense vectors for semantic similarity search

04
🗄️

Index

vectorstore.py persists all vectors into ChromaDB on disk for fast ANN queries

05
🔍

Retrieve

retriever.py encodes query, runs ANN search, returns Top-K candidate chunks

06
⚖️

Rerank

Cross-encoder scores each candidate against the query and selects best Top-3

07
🤖

Generate

llm.py fills STEM prompt template with ranked context and queries Claude / OpenRouter

Vector Search — Semantic Retrieval Mechanics
QUERY TEXT Embedding Model → vector q ∈ ℝ⁷⁶⁸ Cosine similarity CHROMADB VECTOR SPACE q Top-K nearest neighbours TOP-K CANDIDATES Chunk #47 · 0.94 Chunk #23 · 0.91 Chunk #08 · 0.88 Chunk #61 · 0.82 ...10 total retrieved CROSS-ENCODER Re-scores (query, chunk) pairs ★ Chunk #23 · 0.97 ★ Chunk #47 · 0.95 ★ Chunk #08 · 0.89 7 low-confidence discarded → 3 chunks to LLM LLM Claude OpenRouter Answer ✓
Backend Codebase

Backend Modules

Each Python file in the backend carries a single, well-defined responsibility — maximising testability and independent evolution.

ingestion.py

Document Ingestion

Entry point for all document processing. Handles PDF, DOCX, PPTX, XLSX, TXT, CSV, and HTML. Extracts text per page, identifies table regions, and routes figures to the extracted_images/ directory. Produces a uniform list of content objects for the embedding stage.

embeddings.py

Dense Embedding

Wraps the chosen bi-encoder model and exposes a clean embed(texts) interface. Handles batching, L2 normalisation, and model-specific quirks. Used identically at index time and query time to ensure vector-space consistency.

vectorstore.py

Vector Store

Thin abstraction over ChromaDB. Manages collection creation, bulk upsert of (vector, metadata, id) triples, and cosine-similarity queries. Loads the persistent chroma_db/ directory at startup — no re-indexing required between sessions.

retriever.py

Two-Stage Retrieval

Orchestrates fast ANN search over ChromaDB to surface top-K candidates, then applies the cross-encoder reranker to rescore and filter to the highest-confidence top-3 chunks for context injection.

llm.py

LLM Interface

Manages inference routing. Primary path calls the Claude API; falls back to OpenRouter for alternative model configurations. Accepts the ranked context window, populates the STEM prompt template, and returns the final grounded answer.

app.py

Orchestrator

Top-level application entry point. Wires all backend modules together, handles the request/response lifecycle, and exposes the query interface. Reads config at startup to initialise model parameters, chunking settings, and retrieval depth.

Retrieval Strategy

Two-Stage Retrieval & Reranking

Why naive vector search is insufficient for STEM — and how the cross-encoder reranker resolves that.

Stage 1 — Bi-Encoder (ANN)

The embedding model independently encodes the query and all document chunks into dense vectors during indexing. At query time, ChromaDB's approximate nearest-neighbour search finds the top-K chunks in milliseconds. This stage prioritises speed over precision — bi-encoders do not model query-passage interaction, so some relevant chunks may be geometrically displaced by surface-level similarity to irrelevant passages.

Stage 2 — Cross-Encoder (Reranker)

The cross-encoder receives the full (query, passage) pair concatenated and computes a deep relevance score. It cannot scale to full-corpus search, but on the 10–20 candidates from Stage 1 it executes in milliseconds. For STEM content this is essential: passages about "fluid dynamics pressure" and "blood pressure dynamics" are geometrically proximate in embedding space — the cross-encoder correctly resolves the domain distinction.

Why STEM Demands Reranking

Challenge Example Resolution
Symbol ambiguity "σ" in stress analysis vs. statistics Surrounding context resolves domain
Formula retrieval Navier-Stokes derivation Equation-heavy passages scored higher
Unit equivalence N/m² vs. Pa vs. kPa Joint model handles notation variants
Near-duplicate chunks Same theorem, different notation Clearest, most complete version selected

Retrieval Parameters

  • Top-K (ANN Stage) 10–20 candidates, configurable via config
  • Top-N (after rerank) 3 chunks injected into LLM context window
  • Similarity metric Cosine distance in ChromaDB collection
  • Metadata filters Scope by document, chapter, or content type
  • Reranker model Sentence-transformers cross-encoder

Supported Document Types

  • PDF Full text extraction with pdfplumber / PyMuPDF
  • DOCX / PPTX python-docx, python-pptx parsers
  • XLSX openpyxl for tabular scientific data
  • TXT · CSV · HTML Direct ingestion with metadata tagging
Technology

Technology Stack

Chosen for local-first, privacy-preserving operation — no data leaves the infrastructure without explicit routing.

Core Infrastructure
ChromaDB sentence-transformers Cross-Encoder Claude API OpenRouter Python 3.10+
Document Processing
pdfplumber PyMuPDF python-docx python-pptx openpyxl Pillow numpy

Design Principles

The system is engineered around three principles: modularity (each component is independently replaceable), locality (ChromaDB persists on-disk with no external service dependency), and grounding (every LLM response is anchored to retrieved source material, with citations traceable to the original document and page).

LLM Routing

Primary inference uses the Anthropic Claude API for maximum reasoning fidelity on complex STEM queries. OpenRouter provides a configurable fallback for alternative model families — both paths consume identical prompt templates and return structured, citation-annotated responses.