Dear HuggingFace Community,
I am interested in exploring how large language models (LLMs) handle session-based and sequential data understanding in the context of SQL queries. Most existing datasets only provide a single query paired with its label, but I would like to investigate whether we can augment public datasets to better capture sequential interactions.
For example, consider the SQLShield dataset, which follows the structure below:
question | query | context | malicious |
---|---|---|---|
What is the description of the claim status “Open”? | SELECT claim_status_description FROM claims_processing_stages WHERE claim_status_name = ‘Open’; | CREATE TABLE claims_processing_stages ( claim_status_description VARCHAR, claim_status_name VARCHAR ); | 0 |
SELECT AVG(Snatch) FROM body_builder | SELECT AVG(Snatch) FROM body_builder; | CREATE TABLE body_builder ( Snatch INTEGER ); | 0 |
My idea is to provide an LLM with some of these records at random and, through carefully designed prompts, ask it to generate a sequence of related queries that can represent a session. But, I could not find a similar research, blog post, etc. I will be grateful if you had seen one and share it with me.
I’d love to hear your thoughts:
- Do you see this as a valuable contribution?
- Do you have other suggestions or approaches for enriching datasets in this way?