ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length

Hello everyone!

I’m trying to execute the code from the summarization task of chapter 7 in the Hugging Face Course. I’m running it on Jupyter Notebooks. But when I execute these lines:

features = [tokenized_datasets[“train”][i] for i in range(2)]
data_collator(features)

I get this error:

You’re using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding.


ValueError Traceback (most recent call last)
/opt/conda/lib/python3.9/site-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
715 if not is_tensor(value):
→ 716 tensor = as_tensor(value)
717

ValueError: expected sequence of length 7 at dim 1 (got 6)

During handling of the above exception, another exception occurred:

ValueError Traceback (most recent call last)
/tmp/ipykernel_615/1644402533.py in
1 features = [tokenized_datasets[“train”][i] for i in range(2)]
----> 2 data_collator(features)

/opt/conda/lib/python3.9/site-packages/transformers/data/data_collator.py in call(self, features, return_tensors)
584 feature[“labels”] = np.concatenate([remainder, feature[“labels”]]).astype(np.int64)
585
→ 586 features = self.tokenizer.pad(
587 features,
588 padding=self.padding,

/opt/conda/lib/python3.9/site-packages/transformers/tokenization_utils_base.py in pad(self, encoded_inputs, padding, max_length, pad_to_multiple_of, return_attention_mask, return_tensors, verbose)
2979 batch_outputs[key].append(value)
2980
→ 2981 return BatchEncoding(batch_outputs, tensor_type=return_tensors)
2982
2983 def create_token_type_ids_from_sequences(

/opt/conda/lib/python3.9/site-packages/transformers/tokenization_utils_base.py in init(self, data, encoding, tensor_type, prepend_batch_axis, n_sequences)
204 self._n_sequences = n_sequences
205
→ 206 self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
207
208 @property

/opt/conda/lib/python3.9/site-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
730 “Please see if a fast version of this tokenizer is available to have this feature available.”
731 )
→ 732 raise ValueError(
733 “Unable to create tensor, you should probably activate truncation and/or padding with”
734 " ‘padding=True’ ‘truncation=True’ to have batched tensors with the same length. Perhaps your"

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with ‘padding=True’ ‘truncation=True’ to have batched tensors with the same length. Perhaps your features (labels_mask in this case) have excessive nesting (inputs type list where type int is expected).

Unfortunately, I do not know where to look further for a solution. Does anybody know how to solve this? Thank you very much in advance!

1 Like

You have this preprocess function, right?

def preprocess_function(examples):
    model_inputs = tokenizer(
        examples["review_body"],
        max_length=max_input_length,
        truncation=True,
    )
    labels = tokenizer(examples["review_title"], max_length=max_target_length, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    model_inputs["labels_mask"] = labels["attention_mask"]
    return model_inputs

All you have to do is to remove the line:
model_inputs["labels_mask"] = labels["attention_mask"]

So at the end you end up with:

def preprocess_function(examples):
    model_inputs = tokenizer(
        text = examples["review_body"],
        max_length=max_input_length,
        truncation=True,
    )
    labels = tokenizer(examples["review_title"], max_length=max_target_length, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

This actually it is the problem, but can we have an explanation why is this the issue? There are two masks, the one coming from running tokenizer on the examples["review_body"], and one coming from running the tokenizer on examples["review_title"]. The first one is not an issue, but the second is and I cannot understand why.

1 Like