Can a Model Learn to Generate Better Augmented Data?

ZAID041 · August 20, 2025, 3:43pm

While working on a competition recently, I noticed something interesting: my model would overfit really quickly. With only ~2k rows, it was clear the dataset wasn’t enough. I wanted to try standard augmentation techniques, but I also felt that using LLMs could be the best way to improve things… though most require API keys, which makes experimenting a bit harder.

That got me thinking: why don’t we have a dedicated model built for text augmentation yet? We have so many types of models, but no one has really made a “super” augmentation model that generates high-quality data for downstream tasks.

Here’s the approach I’m imagining—turning a language model into a self-teaching augmentation engine:

Start small, think big – Begin with a lightweight LM, like Qwen3-0.6B, so it’s fast and easy to experiment with.
Generate new ideas – Give it prompts to create augmented versions of your text, producing more data than your original tiny dataset.
Keep only the good stuff – Use a strong multi-class classifier to check each new example. If it preserves the original label, keep it; if not, discard it.
Learn from success – Fine-tune your LM on the filtered examples, so it improves its augmentation skills over time.
Repeat and grow – Run the loop again with fresh data, gradually building a self-improving, super-augmentation model that keeps getting smarter and generates high-quality data for any downstream task.

The main challenge is filtering correctly. I think a classifier with 100+ classes could do the job: if the label stays the same, keep it; if not, discard it.

I haven’t started working on this yet, but I’m really curious to hear your thoughts: could something like this make augmentation easier and more effective, or are classic techniques already doing the job well enough? Any feedback, ideas, or experiences would be amazing!

Topic		Replies	Views
Data augmentation FUNSD dataset & LayoutLMv3 Beginners	2	556	November 13, 2023
How to augment specific data type? Research	2	31	May 10, 2025
What Data Should i Validate my Model Against while Training? Beginners	0	430	April 27, 2021
Making fine-tuned LLM model more stable Beginners	3	1003	December 30, 2023
Text classification and generation from the same model Beginners	1	834	July 27, 2023

Can a Model Learn to Generate Better Augmented Data?

Related topics