Label 2 id not working

I recently starting to make a text classification pipeline using this tutorial:

I converted my own data to a dataset with 1 col called text and 1 col called label.

I then did what the tutorial said but I got error:

“”"

Unable to create tensor, you should probably activate truncation and/or padding with ‘padding=True’ ‘truncation=True’ to have batched tensors with the same length. Perhaps your features (label in this case) have excessive nesting (inputs type list where type int is expected).

“”"

I thought id2label and label2id parameter in the model would take care of this. But it didn’t so I added a line in my batch tokenizer to convert the labels to int.

now my tokenized dataset has 1 col called text, 1 col called label, 1 col called input_ids, 1 col called attention_masks and 1 more column.

My question is what columns does the Trainer use to train and validate the text classification pipeline? should I remove all other cols from my tokenized dataset?

1 Like

From reading the troubleshooting docs, the trainer ignores any columns that it doesn’t use, like any untokenized string fields.

Could you share your tokenizer code to convert string labels to int? I am also having an issue where the tutorial code is broken and the trainer seems to ignore id2label and label2id.

1 Like