Extremely confusing or non-existent documentation about the Seq2Seq trainer

I’ve been trying to train a model to translate database metadata + human requests into valid SQL.

Initially, I used a wiki SQL base + a custom pytorch script (worked fine) but I decided I want to train my own from scratch and I’d better go with the “modern” method of using a trainer.

The code I currently have is:

        self.tokenizer = T5Tokenizer.from_pretrained("t5-small")
        self.model = T5ForConditionalGeneration.from_pretrained("t5-small")
        print('Creating datasets')
        train_dataset = Dataset.from_dict({
            'request': [x['prompt'] for x in data[:int(len(data) * 0.8)]],
            'label': [x['completion'] for x in data[:int(len(data) * 0.8)]]
        })
        eval_dataset = Dataset.from_dict({
            'request': [x['prompt'] for x in data[int(len(data) * 0.8):]],
            'label': [x['completion'] for x in data[int(len(data) * 0.8):]]
        })

        print('Creating and starting trainer')
        # Initialize our Trainer
        trainer = Seq2SeqTrainer(
            model=self.model,
            train_dataset=train_dataset,
            eval_dataset=eval_dataset,
            tokenizer=self.tokenizer,
            compute_metrics=self.compute_metrics,
            args=Seq2SeqTrainingArguments(
                output_dir='hft',
                overwrite_output_dir=True,
                do_train=True,
                do_eval=True,
                num_train_epochs=20,
                generation_max_length=512,
            )
        )

        trainer.evaluate()
        trainer.train()
        trainer.evaluate()

Where the prompt and completion keys are both strings.

This simply yileds the error:

***** Running Evaluation *****
  Num examples = 20
  Batch size = 8
Traceback (most recent call last):
  File "itg/t5take5.py", line 71, in <module>
    t5t5.train(sparc_to_prompt())
  File "itg/t5take5.py", line 59, in train
    trainer.evaluate()
  File "/home/george/.local/lib/python3.8/site-packages/transformers/trainer_seq2seq.py", line 70, in evaluate
    return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
  File "/home/george/.local/lib/python3.8/site-packages/transformers/trainer.py", line 2151, in evaluate
    output = eval_loop(
  File "/home/george/.local/lib/python3.8/site-packages/transformers/trainer.py", line 2313, in evaluation_loop
    for step, inputs in enumerate(dataloader):
  File "/home/george/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/home/george/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 561, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/george/.local/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/home/george/.local/lib/python3.8/site-packages/transformers/data/data_collator.py", line 246, in __call__
    batch = self.tokenizer.pad(
  File "/home/george/.local/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2723, in pad
    raise ValueError(
ValueError: You should supply an encoding or a list of encodings to this method that includes input_ids, but you provided ['label']

This is rather confusing, I’ve tried renaming the label column to something else (SQL), this just results in training failing and evaluation doing nothing with the logs:

***** Running Evaluation *****
  Num examples = 0
  Batch size = 8
The following columns in the training set  don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: request, SQL.

I’ve tried providing the label_names argument, but this is also useless and the same behavior manifests.

Is the trainer’s documentation new?

I also attempted looking at some examples, but the “hard” part, that is to say, how do you actually get a dataset which is formatted in a valid way, is always missing.

Have you tried following the relevant course sections? (I linked to translation but summarization should be the same as well).

Basically you are supplying raw datasets to the Seq2SeqTrainer and this can’t work, as it will need the inputs to the models (input_ids, labels, attention_mask etc.) so you need to tokenize your inputs and labels. This is all done in the examples you mention once you have your dataset with one column for the input texts and one column for the target texts.