ValueError in using DataCollator: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length

I have a dataset object created from a pandas dataframe (no padding at this point because I want to pad it later inside collator):

df = pd.read_csv(file, dtype=object, header=None)
dataset_train = datasets.Dataset.from_pandas(df, preserve_index=False)
dataset_train = dataset_train.map(lambda examples: tokenizer(examples['feature1'],
                                  padding='longest', max_length=2048), 
                                  batched=True, batch_size=32)

When I am using the data collator for masked language modeling as follows:

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True,
                                                mlm_probability=0.2, return_tensors='pt')

I am getting the following error while running the Trainer routine:

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`labels` in this case) have excessive nesting (inputs type `list` where type `int` is expected).

But, I am already padding the input batchwise and using the same batch while training. What am I doing wrong here?

Do you tokenized your labels? This error should happen when you forgot tokenize them. If you didn’t do, you can check this post.

1 Like