A production-grade RAG pipeline engineered for science, technology, engineering, and mathematics corpora. Combines dense vector search, cross-encoder reranking, and frontier LLM generation for grounded, citation-backed technical answers.
A modular, three-layer design that cleanly separates ingestion, retrieval, and generation concerns — enabling independent optimisation at each stage.
From raw document ingestion to grounded LLM generation — each stage is purpose-built for technical STEM content.
ingestion.py reads every supported format, extracts text, tables, and images page-by-page
Text split into overlapping chunks preserving equations, tables, and STEM notation
embeddings.py encodes chunks into dense vectors for semantic similarity search
vectorstore.py persists all vectors into ChromaDB on disk for fast ANN queries
retriever.py encodes query, runs ANN search, returns Top-K candidate chunks
Cross-encoder scores each candidate against the query and selects best Top-3
llm.py fills STEM prompt template with ranked context and queries Claude / OpenRouter
Each Python file in the backend carries a single, well-defined responsibility — maximising testability and independent evolution.
Entry point for all document processing. Handles PDF, DOCX, PPTX, XLSX, TXT, CSV, and HTML. Extracts text per page, identifies table regions, and routes figures to the extracted_images/ directory. Produces a uniform list of content objects for the embedding stage.
Wraps the chosen bi-encoder model and exposes a clean embed(texts) interface. Handles batching, L2 normalisation, and model-specific quirks. Used identically at index time and query time to ensure vector-space consistency.
Thin abstraction over ChromaDB. Manages collection creation, bulk upsert of (vector, metadata, id) triples, and cosine-similarity queries. Loads the persistent chroma_db/ directory at startup — no re-indexing required between sessions.
Orchestrates fast ANN search over ChromaDB to surface top-K candidates, then applies the cross-encoder reranker to rescore and filter to the highest-confidence top-3 chunks for context injection.
Manages inference routing. Primary path calls the Claude API; falls back to OpenRouter for alternative model configurations. Accepts the ranked context window, populates the STEM prompt template, and returns the final grounded answer.
Top-level application entry point. Wires all backend modules together, handles the request/response lifecycle, and exposes the query interface. Reads config at startup to initialise model parameters, chunking settings, and retrieval depth.
Why naive vector search is insufficient for STEM — and how the cross-encoder reranker resolves that.
The embedding model independently encodes the query and all document chunks into dense vectors during indexing. At query time, ChromaDB's approximate nearest-neighbour search finds the top-K chunks in milliseconds. This stage prioritises speed over precision — bi-encoders do not model query-passage interaction, so some relevant chunks may be geometrically displaced by surface-level similarity to irrelevant passages.
The cross-encoder receives the full (query, passage) pair concatenated and computes a deep relevance score. It cannot scale to full-corpus search, but on the 10–20 candidates from Stage 1 it executes in milliseconds. For STEM content this is essential: passages about "fluid dynamics pressure" and "blood pressure dynamics" are geometrically proximate in embedding space — the cross-encoder correctly resolves the domain distinction.
| Challenge | Example | Resolution |
|---|---|---|
| Symbol ambiguity | "σ" in stress analysis vs. statistics | Surrounding context resolves domain |
| Formula retrieval | Navier-Stokes derivation | Equation-heavy passages scored higher |
| Unit equivalence | N/m² vs. Pa vs. kPa | Joint model handles notation variants |
| Near-duplicate chunks | Same theorem, different notation | Clearest, most complete version selected |
Chosen for local-first, privacy-preserving operation — no data leaves the infrastructure without explicit routing.
The system is engineered around three principles: modularity (each component is independently replaceable), locality (ChromaDB persists on-disk with no external service dependency), and grounding (every LLM response is anchored to retrieved source material, with citations traceable to the original document and page).
Primary inference uses the Anthropic Claude API for maximum reasoning fidelity on complex STEM queries. OpenRouter provides a configurable fallback for alternative model families — both paths consume identical prompt templates and return structured, citation-annotated responses.