Retrieval-Augmented Generation

Precision retrieval
for STEM knowledge

A production-grade RAG pipeline engineered for science, technology, engineering, and mathematics corpora. Combines dense vector search, cross-encoder reranking, and frontier LLM generation for grounded, citation-backed technical answers.

ChromaDB Dense Embeddings Cross-Encoder Reranker Claude API OpenRouter Image Extraction

View Architecture See Pipeline

Document Formats

768d

Embedding Dimension

2-Stage

Retrieval Pipeline

Context Chunks to LLM

System Design

Layered Architecture

A modular, three-layer design that cleanly separates ingestion, retrieval, and generation concerns — enabling independent optimisation at each stage.

Processing Stages

The RAG Pipeline

From raw document ingestion to grounded LLM generation — each stage is purpose-built for technical STEM content.

📄

Document Parse

ingestion.py reads every supported format, extracts text, tables, and images page-by-page

✂️

Chunking

Text split into overlapping chunks preserving equations, tables, and STEM notation

⚡

Embed

embeddings.py encodes chunks into dense vectors for semantic similarity search

🗄️

Index

vectorstore.py persists all vectors into ChromaDB on disk for fast ANN queries

🔍

Retrieve

retriever.py encodes query, runs ANN search, returns Top-K candidate chunks

⚖️

Rerank

Cross-encoder scores each candidate against the query and selects best Top-3

🤖

Generate

llm.py fills STEM prompt template with ranked context and queries Claude / OpenRouter

Vector Search — Semantic Retrieval Mechanics

Backend Codebase

Backend Modules

Each Python file in the backend carries a single, well-defined responsibility — maximising testability and independent evolution.

ingestion.py

Document Ingestion

Entry point for all document processing. Handles PDF, DOCX, PPTX, XLSX, TXT, CSV, and HTML. Extracts text per page, identifies table regions, and routes figures to the extracted_images/ directory. Produces a uniform list of content objects for the embedding stage.

embeddings.py

Dense Embedding

Wraps the chosen bi-encoder model and exposes a clean embed(texts) interface. Handles batching, L2 normalisation, and model-specific quirks. Used identically at index time and query time to ensure vector-space consistency.

vectorstore.py

Vector Store

Thin abstraction over ChromaDB. Manages collection creation, bulk upsert of (vector, metadata, id) triples, and cosine-similarity queries. Loads the persistent chroma_db/ directory at startup — no re-indexing required between sessions.

retriever.py

Two-Stage Retrieval

Orchestrates fast ANN search over ChromaDB to surface top-K candidates, then applies the cross-encoder reranker to rescore and filter to the highest-confidence top-3 chunks for context injection.

llm.py

LLM Interface

Manages inference routing. Primary path calls the Claude API; falls back to OpenRouter for alternative model configurations. Accepts the ranked context window, populates the STEM prompt template, and returns the final grounded answer.

app.py

Orchestrator

Top-level application entry point. Wires all backend modules together, handles the request/response lifecycle, and exposes the query interface. Reads config at startup to initialise model parameters, chunking settings, and retrieval depth.

Retrieval Strategy

Two-Stage Retrieval & Reranking

Why naive vector search is insufficient for STEM — and how the cross-encoder reranker resolves that.

Stage 1 — Bi-Encoder (ANN)

The embedding model independently encodes the query and all document chunks into dense vectors during indexing. At query time, ChromaDB's approximate nearest-neighbour search finds the top-K chunks in milliseconds. This stage prioritises speed over precision — bi-encoders do not model query-passage interaction, so some relevant chunks may be geometrically displaced by surface-level similarity to irrelevant passages.

Stage 2 — Cross-Encoder (Reranker)

The cross-encoder receives the full (query, passage) pair concatenated and computes a deep relevance score. It cannot scale to full-corpus search, but on the 10–20 candidates from Stage 1 it executes in milliseconds. For STEM content this is essential: passages about "fluid dynamics pressure" and "blood pressure dynamics" are geometrically proximate in embedding space — the cross-encoder correctly resolves the domain distinction.

Why STEM Demands Reranking

Challenge	Example	Resolution
Symbol ambiguity	"σ" in stress analysis vs. statistics	Surrounding context resolves domain
Formula retrieval	Navier-Stokes derivation	Equation-heavy passages scored higher
Unit equivalence	N/m² vs. Pa vs. kPa	Joint model handles notation variants
Near-duplicate chunks	Same theorem, different notation	Clearest, most complete version selected

Retrieval Parameters

Top-K (ANN Stage) 10–20 candidates, configurable via config
Top-N (after rerank) 3 chunks injected into LLM context window
Similarity metric Cosine distance in ChromaDB collection
Metadata filters Scope by document, chapter, or content type
Reranker model Sentence-transformers cross-encoder

Supported Document Types

PDF Full text extraction with pdfplumber / PyMuPDF
DOCX / PPTX python-docx, python-pptx parsers
XLSX openpyxl for tabular scientific data
TXT · CSV · HTML Direct ingestion with metadata tagging

Technology

Technology Stack

Chosen for local-first, privacy-preserving operation — no data leaves the infrastructure without explicit routing.

Core Infrastructure

ChromaDB sentence-transformers Cross-Encoder Claude API OpenRouter Python 3.10+

Document Processing

pdfplumber PyMuPDF python-docx python-pptx openpyxl Pillow numpy

Design Principles

The system is engineered around three principles: modularity (each component is independently replaceable), locality (ChromaDB persists on-disk with no external service dependency), and grounding (every LLM response is anchored to retrieved source material, with citations traceable to the original document and page).

LLM Routing

Primary inference uses the Anthropic Claude API for maximum reasoning fidelity on complex STEM queries. OpenRouter provides a configurable fallback for alternative model families — both paths consume identical prompt templates and return structured, citation-annotated responses.

Precision retrievalfor STEM knowledge