Column names of custom dataset for use with trainer

Hi all, Please, I would like to pass my custom datasets to trainer for a text classification. I have read an example on how to pass one of the pre-packaged datasets to trainer, but what I don’t understand is: what should be the names of the columns holding the input_ids and the labels after tokenization? Also, before tokenization, what shoud be names of the columns holding the text and labels ? Thanks a lot

hey @rahmanoladi, in general you can have whatever column names you want for the text and labels before tokenization - it’s up to you to decide how the text should be processed.

once you’ve tokenized the text, you shouldn’t need to rename the resulting columns like input_ids and attention_mask (and i wouldn’t recommend this since it will probably break the Trainer logic).

by default, the Trainer looks for the label column name labels but you can override this by specifying the value of TrainingArguments.label_names: Trainer — transformers 4.5.0.dev0 documentation

1 Like

Hey @lewtun , so why in these two tutorials (1, 2) , there is no column named “labels” (there is a column named “label”) and there is also no label_name setting in training arguments? How does the Trainer know the label column is “label”? thanks

There is a pattern for label column in Trainer, as long as your label column name prefix by label(eg. label_name, labels, label_ids.etc), the Trainer will know which one is the label column