ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length

MarianaLC · October 3, 2022, 11:26pm

Hello everyone!

I’m trying to execute the code from the summarization task of chapter 7 in the Hugging Face Course. I’m running it on Jupyter Notebooks. But when I execute these lines:

features = [tokenized_datasets[“train”][i] for i in range(2)]
data_collator(features)

I get this error:

You’re using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding.

ValueError Traceback (most recent call last)
/opt/conda/lib/python3.9/site-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
715 if not is_tensor(value):
→ 716 tensor = as_tensor(value)
717

ValueError: expected sequence of length 7 at dim 1 (got 6)

During handling of the above exception, another exception occurred:

ValueError Traceback (most recent call last)
/tmp/ipykernel_615/1644402533.py in
1 features = [tokenized_datasets[“train”][i] for i in range(2)]
----> 2 data_collator(features)

/opt/conda/lib/python3.9/site-packages/transformers/data/data_collator.py in call(self, features, return_tensors)
584 feature[“labels”] = np.concatenate([remainder, feature[“labels”]]).astype(np.int64)
585
→ 586 features = self.tokenizer.pad(
587 features,
588 padding=self.padding,

/opt/conda/lib/python3.9/site-packages/transformers/tokenization_utils_base.py in pad(self, encoded_inputs, padding, max_length, pad_to_multiple_of, return_attention_mask, return_tensors, verbose)
2979 batch_outputs[key].append(value)
2980
→ 2981 return BatchEncoding(batch_outputs, tensor_type=return_tensors)
2982
2983 def create_token_type_ids_from_sequences(

/opt/conda/lib/python3.9/site-packages/transformers/tokenization_utils_base.py in init(self, data, encoding, tensor_type, prepend_batch_axis, n_sequences)
204 self._n_sequences = n_sequences
205
→ 206 self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
207
208 @property

/opt/conda/lib/python3.9/site-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
730 “Please see if a fast version of this tokenizer is available to have this feature available.”
731 )
→ 732 raise ValueError(
733 “Unable to create tensor, you should probably activate truncation and/or padding with”
734 " ‘padding=True’ ‘truncation=True’ to have batched tensors with the same length. Perhaps your"

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with ‘padding=True’ ‘truncation=True’ to have batched tensors with the same length. Perhaps your features (labels_mask in this case) have excessive nesting (inputs type list where type int is expected).

Unfortunately, I do not know where to look further for a solution. Does anybody know how to solve this? Thank you very much in advance!

popaqy · November 1, 2022, 3:25pm

You have this preprocess function, right?

def preprocess_function(examples):
    model_inputs = tokenizer(
        examples["review_body"],
        max_length=max_input_length,
        truncation=True,
    )
    labels = tokenizer(examples["review_title"], max_length=max_target_length, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    model_inputs["labels_mask"] = labels["attention_mask"]
    return model_inputs

All you have to do is to remove the line:
model_inputs["labels_mask"] = labels["attention_mask"]

So at the end you end up with:

def preprocess_function(examples):
    model_inputs = tokenizer(
        text = examples["review_body"],
        max_length=max_input_length,
        truncation=True,
    )
    labels = tokenizer(examples["review_title"], max_length=max_target_length, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

joanplepi · November 3, 2022, 9:26am

This actually it is the problem, but can we have an explanation why is this the issue? There are two masks, the one coming from running tokenizer on the examples["review_body"], and one coming from running the tokenizer on examples["review_title"]. The first one is not an issue, but the second is and I cannot understand why.

Farnazgh · May 11, 2023, 7:49am

There is a solution here : ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. · Issue #15505 · huggingface/transformers · GitHub

KushwanthK · January 13, 2025, 8:50pm

My issue is almost similar but little different variant of it. I dont in case of using unsloth tokenizer how to deal with it?

Here is the issue link. Perhaps your features (`output` in this case) have excessive nesting (inputs type `list` where type `int` is expected)

Topic		Replies	Views
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with ‘padding=True’ ‘truncation=True’ 🤗Transformers	1	815	November 22, 2023
ValueError: Unable to create tensor, you should probably activate truncation... but only for training on multiple GPUs or with multi-batch 🤗Transformers	3	477	November 8, 2024
ValueError in using DataCollator: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length 🤗Transformers	1	7599	January 26, 2023
ValueError: Unable to create tensor for 1 dataset but not the other of same type 🤗Tokenizers	1	992	March 23, 2022
Trainer.train() padding error but it was working before 🤗Transformers	0	479	May 7, 2021

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length

Related topics