Production Grade Local Ollama Inference CoT + DPO Ready Multi-Format Ingestion

Production Dataset Extraction
Pipeline for LLM Fine-Tuning

A fully local document-to-dataset system that transforms PDFs, DOCX, XLSX, PPTX, HTML, TXT, and CSV files into high-quality QA, Chain-of-Thought, DPO, and multi-hop training data for Unsloth, TRL, and Hugging Face workflows.

7
Input formats supported
8
Export files per run
6
Quality checks scored
0.85
Dedup similarity threshold
  • Document ingestion PDF, DOCX, XLSX, PPTX, HTML, TXT, and CSV extraction with structure-aware processing.
  • Local generation Ollama-driven QA, reasoning traces, DPO pairs, and multi-hop data creation without cloud APIs.
  • Quality controlled Scoring, deduplication, schema-aware outputs, and reporting for reliable fine-tuning datasets.

Pipeline Architecture

Every source document is pushed through a deterministic, quality-controlled workflow before it becomes a final dataset record ready for post-training.

7-stage document to dataset workflow
01

Ingestion

Parse raw content from PDFs, Office files, HTML, TXT, and CSV into extracted records.

02

Chunking

Build 300-400 token semantic chunks with overlap while preserving formulas and tables.

03

QA Generation

Generate question-answer pairs locally using Ollama with resilient JSON recovery.

04

Validation

Score each pair across groundedness, structure, truncation, hallucination, and answer-fit checks.

05

Deduplication

Remove near-duplicate prompts using embedding similarity or exact fallback matching.

06

Enrichment

Add CoT traces, DPO preference pairs, and cross-chunk reasoning examples.

07

Export & Report

Write training-ready dataset files plus human-readable and machine-readable quality reports.

Core Components

The page is rebuilt around the main subsystems that make your pipeline useful in real training workflows, not just as a demo.

SemanticChunker

Splits extracted text into semantically coherent chunks instead of raw fixed windows. Preserves equation regions and table structure to avoid damaging technical content before generation.

300-400 tokens 50 overlap Formula safe

TokenCounter

Uses precise token counting when available and falls back gracefully when not. Truncation respects sentence endings, LaTeX blocks, and table rows to prevent malformed training samples.

tiktoken Fallback mode Safe truncation

ModelRegistry

Detects locally available Ollama models and assigns them by role, such as QA generation, categorization, and visual extraction, so the system stays usable across different local environments.

Ollama Zero config Role routing

Ultra-Robust JSON Parser

Recovers structured QA output from messy LLM responses using multiple fallback strategies. This reduces failed generations and makes the pipeline resilient during long batch jobs.

4-stage fallback Schema recovery Fault tolerant

Visual Extractor VLM

Pulls OCR-like content from scanned pages, figures, and slide imagery through a vision-capable model so visual documents still contribute useful dataset records.

qwen-vl Image text Scanned docs

DatasetStatistics

Tracks quality scores, domains, difficulty, split counts, question length, answer length, and record distributions so you can inspect dataset health after every run.

HTML report JSON report Analytics
Chain of Thought

Reasoning Traces

Generates explicit step-by-step reasoning fields alongside final answers, making the dataset suitable for reasoning-model training and analysis workflows.

dataset_hf_cot.jsonl reasoning + answer
DPO Training

Preference Pairs

Creates chosen and rejected answers by injecting controlled failure modes such as hallucination, incompleteness, or wrong-formula behavior for alignment-style training.

dataset_hf_dpo.jsonl TRL ready
Cross-Chunk QA

Multi-Hop Reasoning

Uses embedding similarity to connect related chunks, then asks the model to create questions that require synthesis across both pieces of source material.

causal comparative inferential

Output Formats

One run produces dataset files for training plus reporting files for quality inspection and automation pipelines.

dataset_hf_sharegpt.jsonl

ShareGPT Format

Conversation-style records compatible with Unsloth and common TRL workflows.

Unsloth
dataset_hf_alpaca.jsonl

Alpaca Format

Instruction-input-output records for broad community fine-tuning tooling.

Instruction
dataset_hf_cot.jsonl

CoT Format

Reasoning and answer fields separated for reasoning-model training.

dataset_hf_dpo.jsonl

DPO Pairs

Chosen and rejected response pairs with rejection labels for alignment training.

DPO
dataset_commercial.json

Commercial JSON

Full metadata-rich schema with domains, difficulty, timestamps, and provenance fields.

Schema
dataset_commercial.jsonl

Commercial JSONL

Streaming-friendly newline-delimited form of the same rich training schema.

Large scale
quality_report.html

HTML Report

Visual inspection report for score histograms, distribution summaries, and dataset stats.

Visual
quality_report.json

JSON Report

Machine-readable dataset diagnostics for CI, automation, and audit workflows.

Automation

Quality Scoring

Every question-answer pair is evaluated before it enters the final dataset so low-signal or broken examples do not pollute training.

25
pts

Groundedness

Rewards answers that stay anchored to the source chunk instead of inventing unsupported claims.

20
pts

Length Control

Filters out trivial one-line outputs and oversized, low-density answers that reduce training quality.

20
pts

No Hallucination Markers

Rejects uncertain or refusal-style outputs that indicate the model failed to answer from the source.

15
pts

No Truncation

Catches mid-sentence cuts, dangling endings, and partial generations caused by token or parser limits.

10
pts

Question Structure

Ensures the prompt is actually phrased like a usable question rather than a malformed statement.

10
pts

Answer Fit

Penalizes copy-paste style outputs where the model repeats the prompt instead of answering it.

Threshold Logic

  • Score ≥ 60: the record passes quality control and continues to deduplication.
  • Score 40-59: the record is discarded and counted as low-quality removal.
  • Score < 40: the record is dropped immediately and logged as a hard failure.
  • Dedup pass: even passing records can still be removed if they are near-duplicates.

Supported Input Formats

Each format has its own extraction path so your pipeline can preserve more structure before generation begins.

.pdf

PDF

Text extraction, scanned-page handling, and visual content support for mixed technical documents.

.docx

DOCX

Structured reading of paragraphs, tables, and inline document content for report-style sources.

.xlsx

XLSX

Sheet-wise conversion of tabular data into readable structured text for downstream generation.

.pptx

PPTX

Slide text, notes, and image-aware handling to support deck-style educational and business content.

.html

HTML

Clean tag stripping and readable text extraction for webpages, docs, and generated documentation.

.txt

TXT

Lightweight direct ingestion with logical block partitioning for notes, logs, and raw corpora.

.csv

CSV

Header-aware row formatting that turns structured records into useful textual training entries.

Operational Metrics

These values communicate the practical operating behavior of the pipeline in a portfolio-friendly way.

300-400
Chunk Size
Semantic token window per segment
50
Token Overlap
Context retained between chunks
0.85
Dedup Threshold
Cosine similarity cut-off
60
Pass Score
Minimum QA quality requirement

Technology Stack

The page keeps the AIRPD premium visual style, but the tech stack is rewritten around your real pipeline architecture and local-first workflow.

Core AI Stack

Ollama API

Local model orchestration for QA generation, categorization, and enrichment stages.

sentence-transformers

Semantic embeddings for chunk relationships, retrieval-style matching, and dedup support.

BAAI/bge-m3

Similarity engine for duplicate filtering and cross-chunk multi-hop pairing.

all-MiniLM-L6-v2

Lightweight embedding fallback for chunking and semantic segmentation workflows.

Document Processing

pdfplumber + PyMuPDF

Robust PDF text extraction and fallback handling for technical and mixed-layout documents.

python-docx + openpyxl + python-pptx

Format-specific extraction for Office files while preserving document structure.

pdf2image + Pillow

Visual conversion and image processing for scanned pages and slide-based assets.

Custom HTML parsing

Readable text extraction for web content without leaving heavy markup in the pipeline.

Validation & Reporting

tiktoken

Precise token accounting and safer output shaping before generation and export.

scikit-learn + numpy

Vector math, similarity logic, and general analytics support across the pipeline.

tenacity

Retry safety for unstable model responses and long-running generation jobs.

matplotlib

Visual quality reporting for distribution plots, score histograms, and summary exports.

Privacy-First Local Workflow

The page emphasizes what makes this project special: it is not just another AI wrapper. It is built to run locally, keep source material on-device, and remain usable even in constrained environments.

  • No cloud dependency: inference is designed around local Ollama execution.
  • No external embeddings required: similarity and enrichment are intended for local processing.
  • No mandatory API keys: the workflow is built to be self-contained and reproducible.
  • Strong fit for sensitive documents: useful for internal manuals, research notes, and proprietary material.