The unglamorous bug in every fine-tuning tutorial: nobody cleans the data

Fine-tuning posts on this forum mostly focus on models, LoRA configs, and trainers. The input layer — the dataset — gets far less attention, even though most failed fine-tunes we see in practice come down to dirty data.

At ModelBrew we spend most of our time on this problem. So we ran a small reproducible experiment to put numbers on it.

Setup

We took four well-known instruction-tuning datasets from the Hub (searchable by name): medalpaca/medical_meadow_medqa, b-mc2/sql-create-context, openai/gsm8k, and gbharti/finance-alpaca.

For each, we converted rows to {instruction, output} JSONL, trimmed to roughly 2 MB, and injected a known, fixed amount of noise on top of the clean baseline:

Noise category Injected fraction
Exact duplicate rows ~1.0%
Empty output ~1.0%
One-word outputs ("ok", ".", "yes") ~1.0%
HTML-wrapped outputs (<p>…</p>, &amp;) ~0.5%
Leading/trailing whitespace + zero-width unicode ~0.5%

Ground-truth counts per file are known to the row. We then scanned each poisoned file with the ModelBrew Dataset Optimizer.

What we injected, per file

file rows duplicate empty too-short html whitespace
finance-alpaca 1848 18 18 18 9 9
medical_meadow_medqa 2033 20 20 20 10 10
gsm8k 3807 37 37 37 18 18
sql-create-context 7238 71 71 71 35 35

The Optimizer caught the injected noise across all four files. One surprise worth flagging: even before we added any noise, the raw finance-alpaca source had real PII rows and “As an AI language model…” slop leaking through. The scanner caught those too. Public HF datasets are not as clean as people assume.

Why this matters for anyone fine-tuning here

Common data pathologies we see in customer datasets that don’t fail CI but quietly degrade the trained model:

  • 30–50% near-duplicates from upstream scraping → wasted compute and implicit upweighting of a few examples.
  • PII leaking through from customer support logs → memorization and extractable-at-inference risk.
  • HTML/markdown from web scrapes → models learn to emit markup.
  • A few percent of rows containing “As an AI language model…” slop → unintended persona injection.
  • Empty or one-word completions teaching the model to output nothing.

None of this is visible from eyeballing the first 20 rows.

Artifacts and where to try it

We’ve published the 4 poisoned test files plus a JSON manifest of the exact injection counts as a Hub dataset so others can benchmark their own data-quality tools against the same ground truth: modelbrew/optimizer-noise-benchmark.

The scanner itself — score, per-issue breakdown, one-click autofix, and export — is a free tool at app.modelbrew.ai. No signup needed to scan a file.

Happy to discuss methodology, add more datasets to the benchmark, or hear what noise categories we should inject next. If you’ve shipped a fine-tune that went sideways because of data, what was the root cause?

Great user experience overall. Makes working with data optimization much more accessible.

pipeline is what 2026 needs