A fully local, production-grade system that transforms PDFs, DOCX, XLSX, PPTX, HTML, TXT, and CSV files into high-quality QA, Chain-of-Thought, DPO preference, and multi-hop training datasets — powered entirely by Ollama, no cloud APIs required.
Every source document passes through a deterministic, quality-controlled workflow before a single record enters the final dataset. Each stage has one responsibility.
Parse raw content from PDFs, Office files, HTML, TXT, and CSV into normalised extracted records.
Build 300–400 token semantic chunks with 50-token overlap, preserving formulas and table structures.
Generate question-answer pairs locally via Ollama with resilient 4-stage JSON recovery fallback.
Score each pair across 6 dimensions: groundedness, length, hallucination markers, truncation, structure, answer-fit.
Remove near-duplicate prompts using embedding cosine similarity at 0.85 threshold with exact-match fallback.
Add CoT reasoning traces, DPO preference pairs with controlled rejection, and cross-chunk multi-hop records.
Write 8 training-ready dataset files plus HTML visual report and JSON machine-readable quality diagnostics.
Six purpose-built subsystems, each with a single clear role — enabling independent testing, replacement, and optimisation across the pipeline.
Splits extracted text into semantically coherent segments rather than raw fixed windows. Preserves equation regions and table structure to prevent damage to technical content before generation begins.
Uses tiktoken for precise token accounting when available, falling back gracefully otherwise. Truncation respects sentence endings, LaTeX blocks, and table rows to prevent malformed training samples.
Detects locally available Ollama models and assigns them by role — QA generation, categorisation, visual extraction — keeping the system usable across heterogeneous local environments without configuration.
Recovers structured QA output from malformed LLM responses using a 4-stage fallback strategy: strict parse → lenient parse → regex extraction → schema reconstruction. Eliminates silent failures in long batch jobs.
Pulls OCR-like content from scanned pages, figures, and slide imagery through a vision-capable local model, ensuring visual documents still contribute high-quality dataset records.
Tracks quality scores, domain distribution, difficulty levels, split counts, and length histograms after every run. Outputs both a visual HTML report and a machine-readable JSON diagnostic for CI integration.
Three enrichment modes that transform a standard QA dataset into training material suitable for reasoning models, alignment, and multi-step inference.
Generates explicit step-by-step reasoning fields alongside final answers, making the dataset directly suitable for reasoning-model training and post-hoc analysis of model behaviour across domains.
Creates chosen and rejected answer pairs by injecting controlled failure modes — hallucination, incompleteness, wrong-formula substitution — producing alignment-ready data for RLHF and DPO training loops.
Uses embedding similarity to identify semantically related but non-adjacent chunks, then prompts the model to generate causal, comparative, and inferential questions that require synthesis across both source passages.
One pipeline run produces six training-ready dataset files and two quality report files — all in a single output directory.
Conversation-style records with roles and turns, compatible with Unsloth and standard TRL fine-tuning workflows.
UnslothInstruction-input-output records for broad community fine-tuning toolchains and Hugging Face dataset loaders.
InstructionReasoning and answer fields separated into distinct keys for training reasoning-capable language models.
CoTChosen and rejected response pairs with rejection-type labels, ready for TRL DPO training loops.
AlignmentFull metadata-rich schema including domains, difficulty scores, timestamps, source filenames, and provenance fields.
SchemaStreaming-friendly newline-delimited variant of the commercial schema for large-scale data pipelines.
Large-scaleScore histograms, domain distribution charts, difficulty breakdowns, and dataset health summary for human inspection.
VisualStructured dataset diagnostics for CI pipelines, audit trails, and automated quality gate enforcement.
AutomationEvery question-answer pair is evaluated across six dimensions before entering the final dataset. Low-signal or broken examples are filtered before deduplication runs.
Rewards answers anchored to the source chunk. Penalises unsupported claims or content absent from the provided context passage.
Filters trivial one-line outputs and oversized, low-density answers that degrade training signal quality.
Rejects uncertain hedges and refusal-style outputs indicating the model failed to generate a grounded answer.
Detects mid-sentence cuts, dangling endings, and partial generations caused by token-limit or parser failures.
Verifies the prompt is phrased as a usable question rather than a malformed statement or degenerate output.
Penalises copy-paste style outputs where the model repeats prompt text verbatim instead of generating a response.
Each format has a dedicated extraction path that preserves document structure before chunking — not a generic text dump.
Text extraction via pdfplumber with PyMuPDF fallback. Scanned-page VLM routing and mixed-layout support for technical documents.
Structured reading of paragraphs, inline tables, and document sections via python-docx for report-style sources.
Sheet-wise conversion of tabular data into readable structured text via openpyxl for data-heavy scientific sources.
Slide text, speaker notes, and image-aware handling via python-pptx to support educational and research decks.
Clean tag stripping with readable text extraction for webpages, documentation sites, and generated HTML reports.
Lightweight direct ingestion with logical paragraph-block partitioning for notes, logs, and raw text corpora.
Header-aware row formatting that turns structured records into coherent textual training entries for downstream QA.
Calibrated defaults that reflect the practical operating behaviour of the pipeline across diverse document types and local model configurations.
Chosen for local-first, privacy-preserving operation. No cloud dependency. Designed to run in constrained environments including air-gapped systems.
Local model orchestration for QA generation, categorisation, and enrichment stages with role-based routing.
Semantic embeddings for chunk relationships, retrieval-style matching, and deduplication support.
Primary similarity engine for near-duplicate filtering and cross-chunk multi-hop pairing.
Lightweight embedding fallback for chunking and semantic segmentation when bge-m3 is unavailable.
Robust PDF text extraction with fallback handling for technical, mixed-layout, and scanned documents.
Format-specific extraction for Office documents preserving paragraph structure, tables, and slide notes.
Sheet-wise extraction for Excel files with header-aware row formatting for tabular scientific data.
Visual conversion and image processing for scanned pages routed to the VLM extraction path.
Precise token accounting for safe output shaping before generation and per-field length validation.
Cosine similarity computation, vector math, and statistical analytics across the pipeline.
Retry safety for unstable Ollama responses and resilience during long-running generation jobs.
Score histograms, distribution plots, and domain breakdowns embedded in the HTML quality report.
The pipeline is engineered around locality: source documents never leave the machine. There are no mandatory external API calls at any stage of the 7-step workflow — inference, embedding, validation, and export all run on-device.