While working on a competition recently, I noticed something interesting: my model would overfit really quickly. With only ~2k rows, it was clear the dataset wasn’t enough. I wanted to try standard augmentation techniques, but I also felt that using LLMs could be the best way to improve things… though most require API keys, which makes experimenting a bit harder.
That got me thinking: why don’t we have a dedicated model built for text augmentation yet? We have so many types of models, but no one has really made a “super” augmentation model that generates high-quality data for downstream tasks.
Here’s the approach I’m imagining—turning a language model into a self-teaching augmentation engine:
- Start small, think big – Begin with a lightweight LM, like Qwen3-0.6B, so it’s fast and easy to experiment with.
- Generate new ideas – Give it prompts to create augmented versions of your text, producing more data than your original tiny dataset.
- Keep only the good stuff – Use a strong multi-class classifier to check each new example. If it preserves the original label, keep it; if not, discard it.
- Learn from success – Fine-tune your LM on the filtered examples, so it improves its augmentation skills over time.
- Repeat and grow – Run the loop again with fresh data, gradually building a self-improving, super-augmentation model that keeps getting smarter and generates high-quality data for any downstream task.
The main challenge is filtering correctly. I think a classifier with 100+ classes could do the job: if the label stays the same, keep it; if not, discard it.
I haven’t started working on this yet, but I’m really curious to hear your thoughts: could something like this make augmentation easier and more effective, or are classic techniques already doing the job well enough? Any feedback, ideas, or experiences would be amazing!