Data Augmentation using LLMs

AylinNaebzadeh · September 12, 2025, 9:49pm

Dear HuggingFace Community,

I am interested in exploring how large language models (LLMs) handle session-based and sequential data understanding in the context of SQL queries. Most existing datasets only provide a single query paired with its label, but I would like to investigate whether we can augment public datasets to better capture sequential interactions.

For example, consider the SQLShield dataset, which follows the structure below:

question	query	context	malicious
What is the description of the claim status “Open”?	SELECT claim_status_description FROM claims_processing_stages WHERE claim_status_name = ‘Open’;	CREATE TABLE claims_processing_stages ( claim_status_description VARCHAR, claim_status_name VARCHAR );	0
SELECT AVG(Snatch) FROM body_builder	SELECT AVG(Snatch) FROM body_builder;	CREATE TABLE body_builder ( Snatch INTEGER );	0

My idea is to provide an LLM with some of these records at random and, through carefully designed prompts, ask it to generate a sequence of related queries that can represent a session. But, I could not find a similar research, blog post, etc. I will be grateful if you had seen one and share it with me.

I’d love to hear your thoughts:

Do you see this as a valuable contribution?
Do you have other suggestions or approaches for enriching datasets in this way?

John6666 · September 13, 2025, 2:52am

Do you see this as a valuable contribution?

Yeah. Seems promising.

juhoinkinen · September 18, 2025, 8:40am

We used a similar approach in a different topic area (LLMs4Subjects competition):

3.2 Synthetic training data

We found that the number of training records provided was quite small compared to the size of the subject vocabulary, so we used the Llama-3.1-8B-Instruct LLM to generate additional synthetic training records. We presented the LLM with each of the existing train records (its title and abstract) at a time along with its manually assigned subject labels (GND preferred terms in either German or English, matching the language of the document). We then asked the LLM to generate a similar record with the same set of subjects plus one additional, randomly chosen preferred term from the GND (step 2 in Figure 1; see also example in Figure 4 in Appendix D). This additional subject caused the LLM to generate a novel record, not just to rephrase the given example, and also helped to expand the subject coverage of the training data set to new GND subjects.

See the report at Annif at SemEval-2025 Task 5: Traditional XMTC augmented by LLMs

AylinNaebzadeh · October 8, 2025, 11:36pm

Thank you for your answer and time!

AylinNaebzadeh · October 8, 2025, 11:38pm

Great! Some other friends told me to also look in similar research topic but other domains!
Recently I found a public session based query dataset and thinking about just adding some malicious queries in sessions… Hope it can be done!

Topic		Replies	Views
Teaching an LLM about your domain only with machine readable data Beginners	3	3779	May 7, 2023
How to add a static dataset to a LLM? Something like a database query Beginners	0	509	May 14, 2023
What is the text dataset format for fintune LLM? Beginners	2	2773	June 8, 2023
How do we insert our own datasets in DPR / RAG retrieval Q&A models? 🤗Transformers	1	1660	October 11, 2020
Text to SQL Model Finetuning Beginners	2	1150	June 28, 2024

Data Augmentation using LLMs

Related topics