Hi all, Please, I would like to pass my custom datasets to trainer for a text classification. I have read an example on how to pass one of the pre-packaged datasets to trainer, but what I don’t understand is: what should be the names of the columns holding the input_ids and the labels after tokenization? Also, before tokenization, what shoud be names of the columns holding the text and labels ? Thanks a lot
hey @rahmanoladi, in general you can have whatever column names you want for the text and labels before tokenization - it’s up to you to decide how the text should be processed.
once you’ve tokenized the text, you shouldn’t need to rename the resulting columns like input_ids
and attention_mask
(and i wouldn’t recommend this since it will probably break the Trainer
logic).
by default, the Trainer
looks for the label column name labels
but you can override this by specifying the value of TrainingArguments.label_names
: Trainer — transformers 4.5.0.dev0 documentation
Hey @lewtun , so why in these two tutorials (1, 2) , there is no column named “labels” (there is a column named “label”) and there is also no label_name setting in training arguments? How does the Trainer know the label column is “label”? thanks
There is a pattern for label column in Trainer, as long as your label column name prefix by label(eg. label_name, labels, label_ids.etc), the Trainer will know which one is the label column