RAG System ChromaDB STEM Optimized Reranker

Retrieval-Augmented
Generation for
STEM Knowledge

A production-grade RAG pipeline built specifically for science, technology, engineering, and mathematics documents. Combines ChromaDB vector search, dense embeddings, cross-encoder reranking, and Claude/OpenRouter LLMs to deliver precise, grounded answers from your technical corpus.

ChromaDB Vector Store
Dense Embeddings
Cross-Encoder Reranker
Claude + OpenRouter
Image Extraction
USER QUERY "Explain Navier-Stokes..." EMBEDDING MODEL Query → Dense Vector [768d] C CHROMADB VECTOR STORE ANN Search · Top-K Candidates · Metadata Filter STEM Index TEXT CHUNKS Top-10 retrieved IMAGE CHUNKS Visual context CROSS-ENCODER RERANKER Re-scores chunks · Selects Top-3 · STEM priority LLM GENERATION Claude / OpenRouter + Prompt Template

System Architecture

A layered, modular architecture that separates ingestion, retrieval, reranking, and generation cleanly.

INGESTION RETRIEVAL GENERATION 📄 DOCUMENTS PDF·DOCX·TXT PPTX·XLSX·HTML ingestion.py Parse · Chunk · OCR Image Extraction embeddings.py Dense Encoder Vectors [N×768] vectorstore.py ChromaDB Index Persist + Load chroma_db/ Persistent Storage On-disk vectors extracted_images/ Figures · Diagrams Equations · Charts 💬 QUERY STEM Question retriever.py ANN Query · Top-K Fetch ⚖️ RERANKER Cross-Encoder · Score Sort CONTEXT BUILDER Top-3 chunks + Images prompt.txt STEM Template llm.py Claude API OpenRouter Fallback config Model Selection API Keys · Params app.py Main Orchestrator API Server · Query Loop ✅ GROUNDED ANSWER With source citations Hallucination-minimised utils/ Helpers · Logging Text processing LEGEND: Core module Data store Output Reranking Support module Data flow backend/ contains: embeddings.py · ingestion.py · llm.py · retriever.py · vectorstore.py · __init__.py

RAG Pipeline in Detail

From raw document to grounded answer — every stage is optimised for STEM content.

01

Document Parse

ingestion.py reads every supported format, extracts text, tables, and images page-by-page

02

Chunking

Text split into overlapping chunks preserving equations, tables, and STEM notation

03

Embed

embeddings.py encodes chunks into dense vectors ready for semantic similarity search

04

Index

vectorstore.py persists all vectors into ChromaDB on disk for fast ANN queries

05

Retrieve

retriever.py encodes query, runs ANN search, returns Top-K candidate chunks

06

Rerank

Cross-encoder scores each candidate against the query and picks the best Top-3

07

Generate

llm.py fills the prompt template with context and sends to Claude / OpenRouter

🔍 Vector Search — How Semantic Retrieval Works
QUERY TEXT Embedding Model → vector q ∈ ℝ⁷⁶⁸ CHROMADB VECTOR SPACE q Top-K nearest neighbours TOP-K CANDIDATES Chunk #47 · score 0.94 Chunk #23 · score 0.91 Chunk #08 · score 0.88 Chunk #61 · score 0.82 ...10 total CROSS-ENCODER Re-scores (query, chunk) pairs ★ Chunk #23 · 0.97 ★ Chunk #47 · 0.95 ★ Chunk #08 · 0.89 Discarded: 7 low-score chunks → LLM context LLM Claude Open- Router Answer ✓

Backend Modules

Every Python file in the backend/ folder has a single clear responsibility.

📥

ingestion.py

Entry point for all document processing. Handles PDF, DOCX, PPTX, XLSX, TXT, CSV, HTML. Extracts text per page, identifies table regions, and routes images to the extracted_images/ directory. Produces a uniform list of content objects for the embedding stage.

🧮

embeddings.py

Wraps the chosen embedding model and exposes a clean embed(texts) interface. Handles batching, normalisation, and any model-specific quirks. Used by both the indexing pipeline and the query-time retrieval path so both use identical vector representations.

🗄️

vectorstore.py

Thin abstraction layer over ChromaDB. Manages collection creation, bulk upsert of (vector, metadata, id) triples, and similarity queries. Loads the persistent chroma_db/ directory on startup so no re-indexing is needed between sessions.

🔍

retriever.py

Orchestrates the two-stage retrieval: first a fast approximate-nearest-neighbour search over ChromaDB returns the top-K candidates, then the cross-encoder reranker rescores them and returns only the highest-confidence Top-3 chunks for context injection.

🤖

llm.py

Manages LLM inference. Primary path uses the Claude API (key from claude api key.txt). Falls back to OpenRouter for alternative models (configured in openrouter_related.txt). Takes the ranked context, fills prompt.txt, and streams or returns the final answer.

🚀

app.py

Top-level application orchestrator. Wires together all backend modules, handles the request/response lifecycle, and exposes the query interface. Reads config at startup to determine model settings, chunk parameters, and retrieval depth.

Two-Stage Retrieval & Reranking

Why reranking matters — and why it's critical for STEM content specifically.

Stage 1 — Bi-Encoder (ANN)

The embedding model (bi-encoder) independently encodes the query and all document chunks into dense vectors once during indexing. At query time, ChromaDB's approximate nearest-neighbour search finds the top-K chunks in milliseconds. This is fast but approximate — some genuinely relevant chunks may score lower than less-relevant ones because bi-encoders don't model the interaction between query and passage directly.

🎯

Stage 2 — Cross-Encoder (Reranker)

The cross-encoder sees the full (query, passage) pair together and computes a deeper relevance score. It's too slow to run over the entire corpus, but on just 10–20 candidates from Stage 1 it's extremely fast. In STEM contexts, this is critical: a passage about "fluid dynamics pressure" may look geometrically similar to "blood pressure dynamics" — the cross-encoder correctly distinguishes them.

📊

Why STEM Needs This

Semantic embedding similarity alone struggles with:

Challenge Example Reranker Fix
Symbol ambiguity "σ" in stress vs. statistics Context from surrounding text resolves domain
Formula lookups "Navier-Stokes derivation" Cross-encoder scores equation-heavy passages higher
Units & notation N/m² vs. Pa vs. kPa Joint query-passage model handles unit equivalence
Near-duplicate chunks Same theorem, different notation Reranker picks the clearest, most complete version

🔑 Retrieval Parameters

  • Top-K (ANN): Configurable via config — typically 10–20 candidates
  • Top-N (after rerank): 3 chunks sent to LLM context window
  • Similarity metric: Cosine distance in ChromaDB
  • Metadata filters: Can scope by document, chapter, or content type

Technology Stack

Carefully chosen for local-first, privacy-preserving, STEM-optimised operation.

Core Stack

ChromaDB
sentence-transformers
Cross-Encoder
Claude API
OpenRouter
Python 3.10+

Document & Utility

pdfplumber / PyMuPDF
python-docx
openpyxl
python-pptx
Pillow
numpy

📁 Project Structure