RAG System ChromaDB STEM Optimized Reranker

Retrieval-Augmented
Generation for
STEM Knowledge

A production-grade RAG pipeline built specifically for science, technology, engineering, and mathematics documents. Combines ChromaDB vector search, dense embeddings, cross-encoder reranking, and Claude/OpenRouter LLMs to deliver precise, grounded answers from your technical corpus.

ChromaDB Vector Store

Dense Embeddings

Cross-Encoder Reranker

Claude + OpenRouter

Image Extraction

System Architecture

A layered, modular architecture that separates ingestion, retrieval, reranking, and generation cleanly.

RAG Pipeline in Detail

From raw document to grounded answer — every stage is optimised for STEM content.

Document Parse

ingestion.py reads every supported format, extracts text, tables, and images page-by-page

Chunking

Text split into overlapping chunks preserving equations, tables, and STEM notation

Embed

embeddings.py encodes chunks into dense vectors ready for semantic similarity search

Index

vectorstore.py persists all vectors into ChromaDB on disk for fast ANN queries

Retrieve

retriever.py encodes query, runs ANN search, returns Top-K candidate chunks

Rerank

Cross-encoder scores each candidate against the query and picks the best Top-3

Generate

llm.py fills the prompt template with context and sends to Claude / OpenRouter

🔍 Vector Search — How Semantic Retrieval Works

Backend Modules

Every Python file in the backend/ folder has a single clear responsibility.

📥

ingestion.py

Entry point for all document processing. Handles PDF, DOCX, PPTX, XLSX, TXT, CSV, HTML. Extracts text per page, identifies table regions, and routes images to the extracted_images/ directory. Produces a uniform list of content objects for the embedding stage.

🧮

embeddings.py

Wraps the chosen embedding model and exposes a clean embed(texts) interface. Handles batching, normalisation, and any model-specific quirks. Used by both the indexing pipeline and the query-time retrieval path so both use identical vector representations.

🗄️

vectorstore.py

Thin abstraction layer over ChromaDB. Manages collection creation, bulk upsert of (vector, metadata, id) triples, and similarity queries. Loads the persistent chroma_db/ directory on startup so no re-indexing is needed between sessions.

🔍

retriever.py

Orchestrates the two-stage retrieval: first a fast approximate-nearest-neighbour search over ChromaDB returns the top-K candidates, then the cross-encoder reranker rescores them and returns only the highest-confidence Top-3 chunks for context injection.

🤖

llm.py

Manages LLM inference. Primary path uses the Claude API (key from claude api key.txt). Falls back to OpenRouter for alternative models (configured in openrouter_related.txt). Takes the ranked context, fills prompt.txt, and streams or returns the final answer.

🚀

app.py

Top-level application orchestrator. Wires together all backend modules, handles the request/response lifecycle, and exposes the query interface. Reads config at startup to determine model settings, chunk parameters, and retrieval depth.

Two-Stage Retrieval & Reranking

Why reranking matters — and why it's critical for STEM content specifically.

⚡

Stage 1 — Bi-Encoder (ANN)

The embedding model (bi-encoder) independently encodes the query and all document chunks into dense vectors once during indexing. At query time, ChromaDB's approximate nearest-neighbour search finds the top-K chunks in milliseconds. This is fast but approximate — some genuinely relevant chunks may score lower than less-relevant ones because bi-encoders don't model the interaction between query and passage directly.

🎯

Stage 2 — Cross-Encoder (Reranker)

The cross-encoder sees the full (query, passage) pair together and computes a deeper relevance score. It's too slow to run over the entire corpus, but on just 10–20 candidates from Stage 1 it's extremely fast. In STEM contexts, this is critical: a passage about "fluid dynamics pressure" may look geometrically similar to "blood pressure dynamics" — the cross-encoder correctly distinguishes them.

📊

Why STEM Needs This

Semantic embedding similarity alone struggles with:

Challenge	Example	Reranker Fix
Symbol ambiguity	"σ" in stress vs. statistics	Context from surrounding text resolves domain
Formula lookups	"Navier-Stokes derivation"	Cross-encoder scores equation-heavy passages higher
Units & notation	N/m² vs. Pa vs. kPa	Joint query-passage model handles unit equivalence
Near-duplicate chunks	Same theorem, different notation	Reranker picks the clearest, most complete version

🔑 Retrieval Parameters

Top-K (ANN): Configurable via config — typically 10–20 candidates
Top-N (after rerank): 3 chunks sent to LLM context window
Similarity metric: Cosine distance in ChromaDB
Metadata filters: Can scope by document, chapter, or content type

Technology Stack

Carefully chosen for local-first, privacy-preserving, STEM-optimised operation.

Core Stack

ChromaDB

sentence-transformers

Cross-Encoder

Claude API

OpenRouter

Python 3.10+

Document & Utility

pdfplumber / PyMuPDF

python-docx

openpyxl

python-pptx

Pillow

numpy

📁 Project Structure

backend/embeddings.py — Embedding model wrapper (bi-encoder for indexing + query)
backend/ingestion.py — Document parser: PDF, DOCX, XLSX, PPTX, TXT, CSV, HTML + image extraction
backend/llm.py — Claude API + OpenRouter fallback LLM interface
backend/retriever.py — Two-stage retrieval: ANN search → cross-encoder reranking
backend/vectorstore.py — ChromaDB abstraction: index, query, persist, load
chroma_db/ — Persistent vector database storage (auto-created on first ingest)
extracted_images/ — Figures and diagrams extracted from source documents
app.py — Main application: orchestrates all modules, handles queries
config — Model names, chunk sizes, retrieval depth, API endpoints
prompt.txt — STEM-optimised system prompt template for the LLM

Retrieval-Augmented Generation for STEM Knowledge