Hello everyone!
Iām trying to execute the code from the summarization task of chapter 7 in the Hugging Face Course. Iām running it on Jupyter Notebooks. But when I execute these lines:
features = [tokenized_datasets[ātrainā][i] for i in range(2)]
data_collator(features)
I get this error:
Youāre using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the __call__
method is faster than using a method to encode the text followed by a call to the pad
method to get a padded encoding.
ValueError Traceback (most recent call last)
/opt/conda/lib/python3.9/site-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
715 if not is_tensor(value):
ā 716 tensor = as_tensor(value)
717
ValueError: expected sequence of length 7 at dim 1 (got 6)
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
/tmp/ipykernel_615/1644402533.py in
1 features = [tokenized_datasets[ātrainā][i] for i in range(2)]
----> 2 data_collator(features)
/opt/conda/lib/python3.9/site-packages/transformers/data/data_collator.py in call(self, features, return_tensors)
584 feature[ālabelsā] = np.concatenate([remainder, feature[ālabelsā]]).astype(np.int64)
585
ā 586 features = self.tokenizer.pad(
587 features,
588 padding=self.padding,
/opt/conda/lib/python3.9/site-packages/transformers/tokenization_utils_base.py in pad(self, encoded_inputs, padding, max_length, pad_to_multiple_of, return_attention_mask, return_tensors, verbose)
2979 batch_outputs[key].append(value)
2980
ā 2981 return BatchEncoding(batch_outputs, tensor_type=return_tensors)
2982
2983 def create_token_type_ids_from_sequences(
/opt/conda/lib/python3.9/site-packages/transformers/tokenization_utils_base.py in init(self, data, encoding, tensor_type, prepend_batch_axis, n_sequences)
204 self._n_sequences = n_sequences
205
ā 206 self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
207
208 @property
/opt/conda/lib/python3.9/site-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
730 āPlease see if a fast version of this tokenizer is available to have this feature available.ā
731 )
ā 732 raise ValueError(
733 āUnable to create tensor, you should probably activate truncation and/or padding withā
734 " āpadding=Trueā ātruncation=Trueā to have batched tensors with the same length. Perhaps your"
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with āpadding=Trueā ātruncation=Trueā to have batched tensors with the same length. Perhaps your features (labels_mask
in this case) have excessive nesting (inputs type list
where type int
is expected).
Unfortunately, I do not know where to look further for a solution. Does anybody know how to solve this? Thank you very much in advance!