Could it be that remove_unused_columns=False is specified?
If something is not working properly, I think it is safer to define and specify DataCollator yourself. It takes a bit of effort, but…
opened 01:16PM - 24 Jun 24 UTC
closed 08:04AM - 02 Aug 24 UTC
trainer
Hi there!
Currently, columns not used by the model are removed in `self.get_*… _dataloader()` upon data loader creation, but one might want to have them in `compute_metrics` (when `include_inputs_for_metrics=True`).
My case is fine-tuning on prompt-completion's and I use tokenizer's `token_type_ids` as a mask to compute accuracy only on the completion tokens.
To this end, the best way I've come up with is to keep that column in dataset & data loader using `remove_unused_columns=False` and then remove it in `self._prepare_inputs()` by overriding it.
Is there a better way to achieve this? Generally, isn't better to move removing unused columns logic to `self._prepare_inputs` if the logic serves only as the gatekeeper for `model(**inputs)`?
You can avoid this error by passing remove_unused_columns=False to TrainingArguments, but a cleaner solution is to use map to tokenize the dataset before passing it to the Trainer (instead of tokenizing lazily).
After this change, you should get the “The model did not return a loss from the inputs …” error, which you can fix by returning the labels column in the collate function (equal to input_ids).
(DataCollatorForLanguageModelling handles this automatically, so it’s better to perform the to…