Hi @AlanFeder, judging by the stack trace my first guess is that the problem comes from a conflict between padding in the dataset.map
operation vs padding on-the-fly in the Trainer
.
As described in the Trainer
docs, when you pass the tokenizer to the Trainer
it will be used as follows:
The tokenizer used to preprocess the data. If provided, will be used to automatically pad the inputs the maximum length when batching inputs, and it will be saved along the model to make it easier to rerun an interrupted training or reuse the fine-tuned model.
So it seems that in your code, you’re doing padding twice: once in dataset.map
and then again during training.
Can you remove the padding=True
argument from your tokenization step and see if that works?