Fine-tuning posts on this forum mostly focus on models, LoRA configs, and trainers. The input layer — the dataset — gets far less attention, even though most failed fine-tunes we see in practice come down to dirty data.
At ModelBrew we spend most of our time on this problem. So we ran a small reproducible experiment to put numbers on it.
Setup
We took four well-known instruction-tuning datasets from the Hub (searchable by name): medalpaca/medical_meadow_medqa, b-mc2/sql-create-context, openai/gsm8k, and gbharti/finance-alpaca.
For each, we converted rows to {instruction, output} JSONL, trimmed to roughly 2 MB, and injected a known, fixed amount of noise on top of the clean baseline:
| Noise category | Injected fraction |
|---|---|
| Exact duplicate rows | ~1.0% |
Empty output |
~1.0% |
One-word outputs ("ok", ".", "yes") |
~1.0% |
HTML-wrapped outputs (<p>…</p>, &) |
~0.5% |
| Leading/trailing whitespace + zero-width unicode | ~0.5% |
Ground-truth counts per file are known to the row. We then scanned each poisoned file with the ModelBrew Dataset Optimizer.
What we injected, per file
| file | rows | duplicate | empty | too-short | html | whitespace |
|---|---|---|---|---|---|---|
| finance-alpaca | 1848 | 18 | 18 | 18 | 9 | 9 |
| medical_meadow_medqa | 2033 | 20 | 20 | 20 | 10 | 10 |
| gsm8k | 3807 | 37 | 37 | 37 | 18 | 18 |
| sql-create-context | 7238 | 71 | 71 | 71 | 35 | 35 |
The Optimizer caught the injected noise across all four files. One surprise worth flagging: even before we added any noise, the raw finance-alpaca source had real PII rows and “As an AI language model…” slop leaking through. The scanner caught those too. Public HF datasets are not as clean as people assume.
Why this matters for anyone fine-tuning here
Common data pathologies we see in customer datasets that don’t fail CI but quietly degrade the trained model:
- 30–50% near-duplicates from upstream scraping → wasted compute and implicit upweighting of a few examples.
- PII leaking through from customer support logs → memorization and extractable-at-inference risk.
- HTML/markdown from web scrapes → models learn to emit markup.
- A few percent of rows containing “As an AI language model…” slop → unintended persona injection.
- Empty or one-word completions teaching the model to output nothing.
None of this is visible from eyeballing the first 20 rows.
Artifacts and where to try it
We’ve published the 4 poisoned test files plus a JSON manifest of the exact injection counts as a Hub dataset so others can benchmark their own data-quality tools against the same ground truth: modelbrew/optimizer-noise-benchmark.
The scanner itself — score, per-issue breakdown, one-click autofix, and export — is a free tool at app.modelbrew.ai. No signup needed to scan a file.
Happy to discuss methodology, add more datasets to the benchmark, or hear what noise categories we should inject next. If you’ve shipped a fine-tune that went sideways because of data, what was the root cause?
