Fine tuning existing hugging face model on new dataset (text to sql task)

Hi,

I have to build a model which converts nlp to sql query. For that I am using existing model “rakeshkiriyath/gpt2Medium_text_to_sql”. But this model is not giving much accurate result. So, I want to train this existing model on new dataset (text and sql dataset). Please suggest if my approach is correct.

Content of file.txt

{
“question”:“Which name starts with A”,
“context”: “CREATE TABLE Companies (id int,name varchar,address text,email varchar,phone varchar)”,
“answer”: “SELECT name from Companies where name like ‘A%’”
},
{
“question”:“Which name starts with B”,
“context”: “CREATE TABLE Companies (id int,name varchar,address text,email varchar,phone varchar)”,
“answer”: “SELECT name from Companies where name like ‘B%’”
},
{
“question”:“Which name starts with C”,
“context”: “CREATE TABLE Companies (id int,name varchar,address text,email varchar,phone varchar)”,
“answer”: “SELECT name from Companies where name like ‘C%’”
},
{
“question”:“How many distinct id are present”,
“context”: “CREATE TABLE Companies (id int,name varchar,address text,email varchar,phone varchar)”,
“answer”: “SELECT name from Companies where name like ‘A%’”
}

Code to re-train model:

from transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2Config
from transformers import TextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments

Load the pre-trained GPT-2 model and tokenizer

model_name = “rakeshkiriyath/gpt2Medium_text_to_sql” # Adjust based on the model you have
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

Load your custom dataset

Assuming you have a text file with one example per line

train_file = “file.txt”
dataset = TextDataset(
tokenizer=tokenizer,
file_path=train_file,
block_size=128 # Adjust the block size based on your requirements
)

Create a data collator for language modeling

data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer,
mlm=False # For GPT-2, MLM is not used
)

Training arguments

training_args = TrainingArguments(
output_dir=“./output”,
overwrite_output_dir=True,
num_train_epochs=3,
per_device_train_batch_size=4,
save_steps=100,
save_total_limit=2,
)

Create the Trainer

trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=dataset,
)

Train the model

trainer.train()

Save the fine-tuned model

model.save_pretrained(“your_fine_tuned_model_directory”)
tokenizer.save_pretrained(“your_fine_tuned_model_directory”)

Once the model is re-trained and saved, we can apply it.

from transformers import GPT2LMHeadModel, GPT2Tokenizer
model_path = “C:/Users/DELL/Desktop/gpt”
finetunedGPT = GPT2LMHeadModel.from_pretrained(“your_fine_tuned_model_directory”)
finetunedTokenizer = GPT2Tokenizer.from_pretrained(“your_fine_tuned_model_directory”)
def generate_text_to_sql(query, model, tokenizer, max_length=256):
prompt = f"Translate the following English question to SQL: {query}"
input_tensor = tokenizer.encode(prompt, return_tensors=‘pt’)
output = model.generate(input_tensor, max_length=max_length, num_return_sequences=1, pad_token_id=tokenizer.eos_token_id)
decoded_output = tokenizer.decode(output[0], skip_special_tokens=True)
# Return only the SQL part (removing the input text)
sql_output = decoded_output[len(prompt):].strip()
return sql_output
queryList = [“Which name starts with C”]
for query in queryList:
sql_result = generate_text_to_sql(query, finetunedGPT, finetunedTokenizer)
print(sql_result,“\n”)