A fully local document-to-dataset system that transforms PDFs, DOCX, XLSX, PPTX, HTML, TXT, and CSV files into high-quality QA, Chain-of-Thought, DPO, and multi-hop training data for Unsloth, TRL, and Hugging Face workflows.
Every source document is pushed through a deterministic, quality-controlled workflow before it becomes a final dataset record ready for post-training.
Parse raw content from PDFs, Office files, HTML, TXT, and CSV into extracted records.
Build 300-400 token semantic chunks with overlap while preserving formulas and tables.
Generate question-answer pairs locally using Ollama with resilient JSON recovery.
Score each pair across groundedness, structure, truncation, hallucination, and answer-fit checks.
Remove near-duplicate prompts using embedding similarity or exact fallback matching.
Add CoT traces, DPO preference pairs, and cross-chunk reasoning examples.
Write training-ready dataset files plus human-readable and machine-readable quality reports.
The page is rebuilt around the main subsystems that make your pipeline useful in real training workflows, not just as a demo.
Splits extracted text into semantically coherent chunks instead of raw fixed windows. Preserves equation regions and table structure to avoid damaging technical content before generation.
Uses precise token counting when available and falls back gracefully when not. Truncation respects sentence endings, LaTeX blocks, and table rows to prevent malformed training samples.
Detects locally available Ollama models and assigns them by role, such as QA generation, categorization, and visual extraction, so the system stays usable across different local environments.
Recovers structured QA output from messy LLM responses using multiple fallback strategies. This reduces failed generations and makes the pipeline resilient during long batch jobs.
Pulls OCR-like content from scanned pages, figures, and slide imagery through a vision-capable model so visual documents still contribute useful dataset records.
Tracks quality scores, domains, difficulty, split counts, question length, answer length, and record distributions so you can inspect dataset health after every run.
Generates explicit step-by-step reasoning fields alongside final answers, making the dataset suitable for reasoning-model training and analysis workflows.
Creates chosen and rejected answers by injecting controlled failure modes such as hallucination, incompleteness, or wrong-formula behavior for alignment-style training.
Uses embedding similarity to connect related chunks, then asks the model to create questions that require synthesis across both pieces of source material.
One run produces dataset files for training plus reporting files for quality inspection and automation pipelines.
Conversation-style records compatible with Unsloth and common TRL workflows.
Instruction-input-output records for broad community fine-tuning tooling.
Reasoning and answer fields separated for reasoning-model training.
Chosen and rejected response pairs with rejection labels for alignment training.
Full metadata-rich schema with domains, difficulty, timestamps, and provenance fields.
Streaming-friendly newline-delimited form of the same rich training schema.
Visual inspection report for score histograms, distribution summaries, and dataset stats.
Machine-readable dataset diagnostics for CI, automation, and audit workflows.
Every question-answer pair is evaluated before it enters the final dataset so low-signal or broken examples do not pollute training.
Rewards answers that stay anchored to the source chunk instead of inventing unsupported claims.
Filters out trivial one-line outputs and oversized, low-density answers that reduce training quality.
Rejects uncertain or refusal-style outputs that indicate the model failed to answer from the source.
Catches mid-sentence cuts, dangling endings, and partial generations caused by token or parser limits.
Ensures the prompt is actually phrased like a usable question rather than a malformed statement.
Penalizes copy-paste style outputs where the model repeats the prompt instead of answering it.
Each format has its own extraction path so your pipeline can preserve more structure before generation begins.
Text extraction, scanned-page handling, and visual content support for mixed technical documents.
Structured reading of paragraphs, tables, and inline document content for report-style sources.
Sheet-wise conversion of tabular data into readable structured text for downstream generation.
Slide text, notes, and image-aware handling to support deck-style educational and business content.
Clean tag stripping and readable text extraction for webpages, docs, and generated documentation.
Lightweight direct ingestion with logical block partitioning for notes, logs, and raw corpora.
Header-aware row formatting that turns structured records into useful textual training entries.
These values communicate the practical operating behavior of the pipeline in a portfolio-friendly way.
The page keeps the AIRPD premium visual style, but the tech stack is rewritten around your real pipeline architecture and local-first workflow.
Local model orchestration for QA generation, categorization, and enrichment stages.
Semantic embeddings for chunk relationships, retrieval-style matching, and dedup support.
Similarity engine for duplicate filtering and cross-chunk multi-hop pairing.
Lightweight embedding fallback for chunking and semantic segmentation workflows.
Robust PDF text extraction and fallback handling for technical and mixed-layout documents.
Format-specific extraction for Office files while preserving document structure.
Visual conversion and image processing for scanned pages and slide-based assets.
Readable text extraction for web content without leaving heavy markup in the pipeline.
Precise token accounting and safer output shaping before generation and export.
Vector math, similarity logic, and general analytics support across the pipeline.
Retry safety for unstable model responses and long-running generation jobs.
Visual quality reporting for distribution plots, score histograms, and summary exports.
The page emphasizes what makes this project special: it is not just another AI wrapper. It is built to run locally, keep source material on-device, and remain usable even in constrained environments.