ValueError in using DataCollator: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length

iamneerav · January 26, 2023, 12:35am

I have a dataset object created from a pandas dataframe (no padding at this point because I want to pad it later inside collator):

df = pd.read_csv(file, dtype=object, header=None)
dataset_train = datasets.Dataset.from_pandas(df, preserve_index=False)
dataset_train = dataset_train.map(lambda examples: tokenizer(examples['feature1'],
                                  padding='longest', max_length=2048), 
                                  batched=True, batch_size=32)

When I am using the data collator for masked language modeling as follows:

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True,
                                                mlm_probability=0.2, return_tensors='pt')

I am getting the following error while running the Trainer routine:

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`labels` in this case) have excessive nesting (inputs type `list` where type `int` is expected).

But, I am already padding the input batchwise and using the same batch while training. What am I doing wrong here?

nestornav · January 26, 2023, 9:53pm

Do you tokenized your labels? This error should happen when you forgot tokenize them. If you didn’t do, you can check this post.

Topic		Replies	Views
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with ‘padding=True’ ‘truncation=True’ 🤗Transformers	1	815	November 22, 2023
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length 🤗Transformers	4	36787	January 13, 2025
ValueError: Unable to create tensor for 1 dataset but not the other of same type 🤗Tokenizers	1	992	March 23, 2022
Issues with Data Collator and Tokenizing with NER Datasets 🤗Tokenizers	1	2516	May 9, 2022
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`label` in this case) have excessive nesting (inputs typ Beginners	3	920	March 4, 2024

ValueError in using DataCollator: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length

Related topics