You can avoid this error by passing remove_unused_columns=False
to TrainingArguments
, but a cleaner solution is to use map
to tokenize the dataset before passing it to the Trainer
(instead of tokenizing lazily).
After this change, you should get the “The model did not return a loss from the inputs …” error, which you can fix by returning the labels
column in the collate function (equal to input_ids
).
(DataCollatorForLanguageModelling
handles this automatically, so it’s better to perform the tokenization in map
, and then use this collator as a data_collator
, as explained in our NLP course)