I’ve been trying to train a model to translate database metadata + human requests into valid SQL.
Initially, I used a wiki SQL base + a custom pytorch script (worked fine) but I decided I want to train my own from scratch and I’d better go with the “modern” method of using a trainer.
The code I currently have is:
self.tokenizer = T5Tokenizer.from_pretrained("t5-small")
self.model = T5ForConditionalGeneration.from_pretrained("t5-small")
print('Creating datasets')
train_dataset = Dataset.from_dict({
'request': [x['prompt'] for x in data[:int(len(data) * 0.8)]],
'label': [x['completion'] for x in data[:int(len(data) * 0.8)]]
})
eval_dataset = Dataset.from_dict({
'request': [x['prompt'] for x in data[int(len(data) * 0.8):]],
'label': [x['completion'] for x in data[int(len(data) * 0.8):]]
})
print('Creating and starting trainer')
# Initialize our Trainer
trainer = Seq2SeqTrainer(
model=self.model,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=self.tokenizer,
compute_metrics=self.compute_metrics,
args=Seq2SeqTrainingArguments(
output_dir='hft',
overwrite_output_dir=True,
do_train=True,
do_eval=True,
num_train_epochs=20,
generation_max_length=512,
)
)
trainer.evaluate()
trainer.train()
trainer.evaluate()
Where the prompt
and completion
keys are both strings.
This simply yileds the error:
***** Running Evaluation *****
Num examples = 20
Batch size = 8
Traceback (most recent call last):
File "itg/t5take5.py", line 71, in <module>
t5t5.train(sparc_to_prompt())
File "itg/t5take5.py", line 59, in train
trainer.evaluate()
File "/home/george/.local/lib/python3.8/site-packages/transformers/trainer_seq2seq.py", line 70, in evaluate
return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
File "/home/george/.local/lib/python3.8/site-packages/transformers/trainer.py", line 2151, in evaluate
output = eval_loop(
File "/home/george/.local/lib/python3.8/site-packages/transformers/trainer.py", line 2313, in evaluation_loop
for step, inputs in enumerate(dataloader):
File "/home/george/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
data = self._next_data()
File "/home/george/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 561, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/george/.local/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/home/george/.local/lib/python3.8/site-packages/transformers/data/data_collator.py", line 246, in __call__
batch = self.tokenizer.pad(
File "/home/george/.local/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2723, in pad
raise ValueError(
ValueError: You should supply an encoding or a list of encodings to this method that includes input_ids, but you provided ['label']
This is rather confusing, I’ve tried renaming the label
column to something else (SQL), this just results in training failing and evaluation doing nothing with the logs:
***** Running Evaluation *****
Num examples = 0
Batch size = 8
The following columns in the training set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: request, SQL.
I’ve tried providing the label_names
argument, but this is also useless and the same behavior manifests.
Is the trainer’s documentation new?
I also attempted looking at some examples, but the “hard” part, that is to say, how do you actually get a dataset which is formatted in a valid way, is always missing.