Data Augmentation using LLMs

Dear HuggingFace Community,

I am interested in exploring how large language models (LLMs) handle session-based and sequential data understanding in the context of SQL queries. Most existing datasets only provide a single query paired with its label, but I would like to investigate whether we can augment public datasets to better capture sequential interactions.

For example, consider the SQLShield dataset, which follows the structure below:

question query context malicious
What is the description of the claim status “Open”? SELECT claim_status_description FROM claims_processing_stages WHERE claim_status_name = ‘Open’; CREATE TABLE claims_processing_stages ( claim_status_description VARCHAR, claim_status_name VARCHAR ); 0
SELECT AVG(Snatch) FROM body_builder SELECT AVG(Snatch) FROM body_builder; CREATE TABLE body_builder ( Snatch INTEGER ); 0

My idea is to provide an LLM with some of these records at random and, through carefully designed prompts, ask it to generate a sequence of related queries that can represent a session. But, I could not find a similar research, blog post, etc. I will be grateful if you had seen one and share it with me.

I’d love to hear your thoughts:

  • Do you see this as a valuable contribution?
  • Do you have other suggestions or approaches for enriching datasets in this way?
1 Like