I’m running into the issue of the text is of a larger size than the vector embedding space supported.
- I’ve trimmed down my data using
len(text)
operator to identify the size and then trimmed it usingtext[:510]
. But debugging this error, I noticed this is showing up not due to English text in the dataset, but for other languages. - I noticed the text items in Mandarin seems to cause this and somewhere during the encoding in Transformers, it is being expanded to be more than 512.
Question;
- How do I address this issue at a data level?
OR - What is the encoder used in Transformers so I could use it at the data pre-processing level to take care of this issue and to remove items more than 512 in length?
Thank you.
This is the full error log;
File "/mnt/DataDrive/Experiments/python_environments/lib/python3.6/site-packages/transformers/pipelines/pt_utils.py", line 111, in __next__
item = next(self.iterator)
File "/mnt/DataDrive/Experiments/python_environments/lib/python3.6/site-packages/transformers/pipelines/pt_utils.py", line 113, in __next__
processed = self.infer(item, **self.params)
File "/mnt/DataDrive/Experiments/python_environments/lib/python3.6/site-packages/transformers/pipelines/base.py", line 943, in forward
model_outputs = self._forward(model_inputs, **forward_params)
File "/mnt/DataDrive/Experiments/python_environments/lib/python3.6/site-packages/transformers/pipelines/text_classification.py", line 137, in _forward
return self.model(**model_inputs)
File "/mnt/DataDrive/Experiments/python_environments/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/mnt/DataDrive/Experiments/python_environments/lib/python3.6/site-packages/transformers/models/bert/modeling_bert.py", line 1554, in forward
return_dict=return_dict,
File "/mnt/DataDrive/Experiments/python_environments/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/mnt/DataDrive/Experiments/python_environments/lib/python3.6/site-packages/transformers/models/bert/modeling_bert.py", line 994, in forward
past_key_values_length=past_key_values_length,
File "/mnt/DataDrive/Experiments/python_environments/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/mnt/DataDrive/Experiments/python_environments/lib/python3.6/site-packages/transformers/models/bert/modeling_bert.py", line 220, in forward
embeddings += position_embeddings
RuntimeError: The size of tensor a (629) must match the size of tensor b (512) at non-singleton dimension 1