Extremely confusing or non-existent documentation about the Seq2Seq trainer

George3d6 · December 16, 2021, 2:14am

I’ve been trying to train a model to translate database metadata + human requests into valid SQL.

Initially, I used a wiki SQL base + a custom pytorch script (worked fine) but I decided I want to train my own from scratch and I’d better go with the “modern” method of using a trainer.

The code I currently have is:

        self.tokenizer = T5Tokenizer.from_pretrained("t5-small")
        self.model = T5ForConditionalGeneration.from_pretrained("t5-small")
        print('Creating datasets')
        train_dataset = Dataset.from_dict({
            'request': [x['prompt'] for x in data[:int(len(data) * 0.8)]],
            'label': [x['completion'] for x in data[:int(len(data) * 0.8)]]
        })
        eval_dataset = Dataset.from_dict({
            'request': [x['prompt'] for x in data[int(len(data) * 0.8):]],
            'label': [x['completion'] for x in data[int(len(data) * 0.8):]]
        })

        print('Creating and starting trainer')
        # Initialize our Trainer
        trainer = Seq2SeqTrainer(
            model=self.model,
            train_dataset=train_dataset,
            eval_dataset=eval_dataset,
            tokenizer=self.tokenizer,
            compute_metrics=self.compute_metrics,
            args=Seq2SeqTrainingArguments(
                output_dir='hft',
                overwrite_output_dir=True,
                do_train=True,
                do_eval=True,
                num_train_epochs=20,
                generation_max_length=512,
            )
        )

        trainer.evaluate()
        trainer.train()
        trainer.evaluate()

Where the prompt and completion keys are both strings.

This simply yileds the error:

***** Running Evaluation *****
  Num examples = 20
  Batch size = 8
Traceback (most recent call last):
  File "itg/t5take5.py", line 71, in <module>
    t5t5.train(sparc_to_prompt())
  File "itg/t5take5.py", line 59, in train
    trainer.evaluate()
  File "/home/george/.local/lib/python3.8/site-packages/transformers/trainer_seq2seq.py", line 70, in evaluate
    return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
  File "/home/george/.local/lib/python3.8/site-packages/transformers/trainer.py", line 2151, in evaluate
    output = eval_loop(
  File "/home/george/.local/lib/python3.8/site-packages/transformers/trainer.py", line 2313, in evaluation_loop
    for step, inputs in enumerate(dataloader):
  File "/home/george/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/home/george/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 561, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/george/.local/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/home/george/.local/lib/python3.8/site-packages/transformers/data/data_collator.py", line 246, in __call__
    batch = self.tokenizer.pad(
  File "/home/george/.local/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2723, in pad
    raise ValueError(
ValueError: You should supply an encoding or a list of encodings to this method that includes input_ids, but you provided ['label']

This is rather confusing, I’ve tried renaming the label column to something else (SQL), this just results in training failing and evaluation doing nothing with the logs:

***** Running Evaluation *****
  Num examples = 0
  Batch size = 8
The following columns in the training set  don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: request, SQL.

I’ve tried providing the label_names argument, but this is also useless and the same behavior manifests.

Is the trainer’s documentation new?

I also attempted looking at some examples, but the “hard” part, that is to say, how do you actually get a dataset which is formatted in a valid way, is always missing.

sgugger · December 16, 2021, 1:54pm

Have you tried following the relevant course sections? (I linked to translation but summarization should be the same as well).

Basically you are supplying raw datasets to the Seq2SeqTrainer and this can’t work, as it will need the inputs to the models (input_ids, labels, attention_mask etc.) so you need to tokenize your inputs and labels. This is all done in the examples you mention once you have your dataset with one column for the input texts and one column for the target texts.

Topic		Replies	Views
Problem fine-tuning a model with Seq2Seq Trainer Beginners	1	993	June 25, 2023
Seq2SeqTrainer: enabled must be a bool (got NoneType) 🤗Transformers	15	3955	December 5, 2022
Train tokenizer for seq2seq model 🤗Tokenizers	0	340	April 19, 2024
Using Trainer class with T5 - what is returned in EvalPrediction dict? 🤗Transformers	8	5305	February 14, 2022
KeyError:664 with Seq2Seq trainer() Beginners	0	436	July 11, 2023

Extremely confusing or non-existent documentation about the Seq2Seq trainer

Related topics