Data Augmentation using LLMs

Dear HuggingFace Community,

I am interested in exploring how large language models (LLMs) handle session-based and sequential data understanding in the context of SQL queries. Most existing datasets only provide a single query paired with its label, but I would like to investigate whether we can augment public datasets to better capture sequential interactions.

For example, consider the SQLShield dataset, which follows the structure below:

question query context malicious
What is the description of the claim status ā€œOpenā€? SELECT claim_status_description FROM claims_processing_stages WHERE claim_status_name = ā€˜Open’; CREATE TABLE claims_processing_stages ( claim_status_description VARCHAR, claim_status_name VARCHAR ); 0
SELECT AVG(Snatch) FROM body_builder SELECT AVG(Snatch) FROM body_builder; CREATE TABLE body_builder ( Snatch INTEGER ); 0

My idea is to provide an LLM with some of these records at random and, through carefully designed prompts, ask it to generate a sequence of related queries that can represent a session. But, I could not find a similar research, blog post, etc. I will be grateful if you had seen one and share it with me.

I’d love to hear your thoughts:

  • Do you see this as a valuable contribution?
  • Do you have other suggestions or approaches for enriching datasets in this way?
1 Like

Do you see this as a valuable contribution?

Yeah. Seems promising.

We used a similar approach in a different topic area (LLMs4Subjects competition):

3.2 Synthetic training data

We found that the number of training records provided was quite small compared to the size of the subject vocabulary, so we used the Llama-3.1-8B-Instruct LLM to generate additional synthetic training records. We presented the LLM with each of the existing train records (its title and abstract) at a time along with its manually assigned subject labels (GND preferred terms in either German or English, matching the language of the document). We then asked the LLM to generate a similar record with the same set of subjects plus one additional, randomly chosen preferred term from the GND (step 2 in Figure 1; see also example in Figure 4 in Appendix D). This additional subject caused the LLM to generate a novel record, not just to rephrase the given example, and also helped to expand the subject coverage of the training data set to new GND subjects.

See the report at Annif at SemEval-2025 Task 5: Traditional XMTC augmented by LLMs

1 Like