LLM Fine-Tuning Infrastructure

Local document-to-dataset
extraction pipeline

A fully local, production-grade system that transforms PDFs, DOCX, XLSX, PPTX, HTML, TXT, and CSV files into high-quality QA, Chain-of-Thought, DPO preference, and multi-hop training datasets — powered entirely by Ollama, no cloud APIs required.

Production Grade Ollama Local Inference CoT + DPO Ready 7 Input Formats Quality Controlled Zero Cloud Dependency
Pipeline Overview
  • 📥
    Multi-format ingestion PDF, DOCX, XLSX, PPTX, HTML, TXT, and CSV with structure-aware, per-format extraction paths.
  • 🧠
    Local model generation Ollama-driven QA pairs, reasoning traces, DPO preference pairs, and cross-chunk multi-hop synthesis.
  • Quality-controlled output 6-dimension scoring, embedding-based deduplication, and full HTML + JSON reporting per run.
  • 📦
    8 training-ready exports ShareGPT, Alpaca, CoT, DPO, and commercial JSON/JSONL formats for Unsloth, TRL, and HF workflows.
7+
Input Formats
8
Export Files per Run
6
Quality Dimensions
0.85
Dedup Similarity Threshold
Processing Stages

7-Stage Pipeline

Every source document passes through a deterministic, quality-controlled workflow before a single record enters the final dataset. Each stage has one responsibility.

Document-to-dataset workflow · 7 deterministic stages
01 📂

Ingestion

Parse raw content from PDFs, Office files, HTML, TXT, and CSV into normalised extracted records.

02 ✂️

Chunking

Build 300–400 token semantic chunks with 50-token overlap, preserving formulas and table structures.

03 🤖

QA Generation

Generate question-answer pairs locally via Ollama with resilient 4-stage JSON recovery fallback.

04 ⚖️

Validation

Score each pair across 6 dimensions: groundedness, length, hallucination markers, truncation, structure, answer-fit.

05 🔍

Deduplication

Remove near-duplicate prompts using embedding cosine similarity at 0.85 threshold with exact-match fallback.

06 🔗

Enrichment

Add CoT reasoning traces, DPO preference pairs with controlled rejection, and cross-chunk multi-hop records.

07 📤

Export & Report

Write 8 training-ready dataset files plus HTML visual report and JSON machine-readable quality diagnostics.

Core Subsystems

Backend Components

Six purpose-built subsystems, each with a single clear role — enabling independent testing, replacement, and optimisation across the pipeline.

chunker.py

SemanticChunker

Splits extracted text into semantically coherent segments rather than raw fixed windows. Preserves equation regions and table structure to prevent damage to technical content before generation begins.

300–400 tokens 50 overlap Formula-safe
tokens.py

TokenCounter

Uses tiktoken for precise token accounting when available, falling back gracefully otherwise. Truncation respects sentence endings, LaTeX blocks, and table rows to prevent malformed training samples.

tiktoken Graceful fallback Safe truncation
registry.py

ModelRegistry

Detects locally available Ollama models and assigns them by role — QA generation, categorisation, visual extraction — keeping the system usable across heterogeneous local environments without configuration.

Ollama Zero-config Role routing
parser.py

Robust JSON Parser

Recovers structured QA output from malformed LLM responses using a 4-stage fallback strategy: strict parse → lenient parse → regex extraction → schema reconstruction. Eliminates silent failures in long batch jobs.

4-stage fallback Schema recovery Fault-tolerant
vision.py

Visual Extractor (VLM)

Pulls OCR-like content from scanned pages, figures, and slide imagery through a vision-capable local model, ensuring visual documents still contribute high-quality dataset records.

qwen-vl Scanned docs Image text
stats.py

DatasetStatistics

Tracks quality scores, domain distribution, difficulty levels, split counts, and length histograms after every run. Outputs both a visual HTML report and a machine-readable JSON diagnostic for CI integration.

HTML report JSON report Per-run analytics
Advanced Enrichment

Beyond Basic QA

Three enrichment modes that transform a standard QA dataset into training material suitable for reasoning models, alignment, and multi-step inference.

Chain of Thought

Reasoning Traces

Generates explicit step-by-step reasoning fields alongside final answers, making the dataset directly suitable for reasoning-model training and post-hoc analysis of model behaviour across domains.

dataset_hf_cot.jsonl
DPO Training

Preference Pairs

Creates chosen and rejected answer pairs by injecting controlled failure modes — hallucination, incompleteness, wrong-formula substitution — producing alignment-ready data for RLHF and DPO training loops.

dataset_hf_dpo.jsonl
Cross-Chunk QA

Multi-Hop Reasoning

Uses embedding similarity to identify semantically related but non-adjacent chunks, then prompts the model to generate causal, comparative, and inferential questions that require synthesis across both source passages.

multi-hop · causal · comparative
Export Schema

Output Formats

One pipeline run produces six training-ready dataset files and two quality report files — all in a single output directory.

dataset_hf_sharegpt.jsonl

ShareGPT Format

Conversation-style records with roles and turns, compatible with Unsloth and standard TRL fine-tuning workflows.

Unsloth
dataset_hf_alpaca.jsonl

Alpaca Format

Instruction-input-output records for broad community fine-tuning toolchains and Hugging Face dataset loaders.

Instruction
dataset_hf_cot.jsonl

Chain-of-Thought

Reasoning and answer fields separated into distinct keys for training reasoning-capable language models.

CoT
dataset_hf_dpo.jsonl

DPO Preference Pairs

Chosen and rejected response pairs with rejection-type labels, ready for TRL DPO training loops.

Alignment
dataset_commercial.json

Commercial JSON

Full metadata-rich schema including domains, difficulty scores, timestamps, source filenames, and provenance fields.

Schema
dataset_commercial.jsonl

Commercial JSONL

Streaming-friendly newline-delimited variant of the commercial schema for large-scale data pipelines.

Large-scale
quality_report.html

Visual Quality Report

Score histograms, domain distribution charts, difficulty breakdowns, and dataset health summary for human inspection.

Visual
quality_report.json

Machine-Readable Report

Structured dataset diagnostics for CI pipelines, audit trails, and automated quality gate enforcement.

Automation
Scoring System

Quality Scoring

Every question-answer pair is evaluated across six dimensions before entering the final dataset. Low-signal or broken examples are filtered before deduplication runs.

25
pts

Groundedness

Rewards answers anchored to the source chunk. Penalises unsupported claims or content absent from the provided context passage.

20
pts

Length Control

Filters trivial one-line outputs and oversized, low-density answers that degrade training signal quality.

20
pts

Hallucination Markers

Rejects uncertain hedges and refusal-style outputs indicating the model failed to generate a grounded answer.

15
pts

No Truncation

Detects mid-sentence cuts, dangling endings, and partial generations caused by token-limit or parser failures.

10
pts

Question Structure

Verifies the prompt is phrased as a usable question rather than a malformed statement or degenerate output.

10
pts

Answer Fit

Penalises copy-paste style outputs where the model repeats prompt text verbatim instead of generating a response.

Threshold Logic

Score ≥ 60 Record passes quality control and advances to the deduplication stage.
Score 40–59 Record is discarded and counted as a low-quality removal in the run report.
Score < 40 Record is dropped immediately and logged as a hard failure with full diagnostics.
Dedup pass Passing records may still be removed if cosine similarity to an existing record exceeds 0.85.
Document Ingestion

Supported Input Formats

Each format has a dedicated extraction path that preserves document structure before chunking — not a generic text dump.

.pdf

PDF

Text extraction via pdfplumber with PyMuPDF fallback. Scanned-page VLM routing and mixed-layout support for technical documents.

.docx

DOCX

Structured reading of paragraphs, inline tables, and document sections via python-docx for report-style sources.

.xlsx

XLSX

Sheet-wise conversion of tabular data into readable structured text via openpyxl for data-heavy scientific sources.

.pptx

PPTX

Slide text, speaker notes, and image-aware handling via python-pptx to support educational and research decks.

.html

HTML

Clean tag stripping with readable text extraction for webpages, documentation sites, and generated HTML reports.

.txt

TXT

Lightweight direct ingestion with logical paragraph-block partitioning for notes, logs, and raw text corpora.

.csv

CSV

Header-aware row formatting that turns structured records into coherent textual training entries for downstream QA.

Operational Parameters

Key Metrics

Calibrated defaults that reflect the practical operating behaviour of the pipeline across diverse document types and local model configurations.

300–400
Chunk Size
Semantic token window per segment
50
Token Overlap
Context retained between adjacent chunks
0.85
Dedup Threshold
Cosine similarity cut-off for near-duplicates
60
Pass Score
Minimum quality score for dataset inclusion
Infrastructure

Technology Stack

Chosen for local-first, privacy-preserving operation. No cloud dependency. Designed to run in constrained environments including air-gapped systems.

Inference

Core AI Stack

Ollama API

Local model orchestration for QA generation, categorisation, and enrichment stages with role-based routing.

sentence-transformers

Semantic embeddings for chunk relationships, retrieval-style matching, and deduplication support.

BAAI/bge-m3

Primary similarity engine for near-duplicate filtering and cross-chunk multi-hop pairing.

all-MiniLM-L6-v2

Lightweight embedding fallback for chunking and semantic segmentation when bge-m3 is unavailable.

Extraction

Document Processing

pdfplumber + PyMuPDF

Robust PDF text extraction with fallback handling for technical, mixed-layout, and scanned documents.

python-docx + python-pptx

Format-specific extraction for Office documents preserving paragraph structure, tables, and slide notes.

openpyxl

Sheet-wise extraction for Excel files with header-aware row formatting for tabular scientific data.

pdf2image + Pillow

Visual conversion and image processing for scanned pages routed to the VLM extraction path.

Validation

Quality & Reporting

tiktoken

Precise token accounting for safe output shaping before generation and per-field length validation.

scikit-learn + numpy

Cosine similarity computation, vector math, and statistical analytics across the pipeline.

tenacity

Retry safety for unstable Ollama responses and resilience during long-running generation jobs.

matplotlib

Score histograms, distribution plots, and domain breakdowns embedded in the HTML quality report.

Privacy-First Local Workflow

The pipeline is engineered around locality: source documents never leave the machine. There are no mandatory external API calls at any stage of the 7-step workflow — inference, embedding, validation, and export all run on-device.

  • No cloud inference dependency All generation runs through local Ollama model serving.
  • No external embedding calls Similarity and deduplication use locally loaded sentence-transformers.
  • No mandatory API keys The workflow is self-contained and reproducible without credentials.
  • Suitable for sensitive corpora Internal manuals, research drafts, and proprietary technical material stay on-device.