How is the encoding done for transformers? What encoder is used?

I’m running into the issue of the text is of a larger size than the vector embedding space supported.

  • I’ve trimmed down my data using len(text) operator to identify the size and then trimmed it using text[:510]. But debugging this error, I noticed this is showing up not due to English text in the dataset, but for other languages.
  • I noticed the text items in Mandarin seems to cause this and somewhere during the encoding in Transformers, it is being expanded to be more than 512.


  1. How do I address this issue at a data level?
  2. What is the encoder used in Transformers so I could use it at the data pre-processing level to take care of this issue and to remove items more than 512 in length?

Thank you.

This is the full error log;

  File "/mnt/DataDrive/Experiments/python_environments/lib/python3.6/site-packages/transformers/pipelines/", line 111, in __next__
    item = next(self.iterator)
  File "/mnt/DataDrive/Experiments/python_environments/lib/python3.6/site-packages/transformers/pipelines/", line 113, in __next__
    processed = self.infer(item, **self.params)
  File "/mnt/DataDrive/Experiments/python_environments/lib/python3.6/site-packages/transformers/pipelines/", line 943, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
  File "/mnt/DataDrive/Experiments/python_environments/lib/python3.6/site-packages/transformers/pipelines/", line 137, in _forward
    return self.model(**model_inputs)
  File "/mnt/DataDrive/Experiments/python_environments/lib/python3.6/site-packages/torch/nn/modules/", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/DataDrive/Experiments/python_environments/lib/python3.6/site-packages/transformers/models/bert/", line 1554, in forward
  File "/mnt/DataDrive/Experiments/python_environments/lib/python3.6/site-packages/torch/nn/modules/", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/DataDrive/Experiments/python_environments/lib/python3.6/site-packages/transformers/models/bert/", line 994, in forward
  File "/mnt/DataDrive/Experiments/python_environments/lib/python3.6/site-packages/torch/nn/modules/", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/DataDrive/Experiments/python_environments/lib/python3.6/site-packages/transformers/models/bert/", line 220, in forward
    embeddings += position_embeddings
RuntimeError: The size of tensor a (629) must match the size of tensor b (512) at non-singleton dimension 1

Hi @cyrilw,

What model do you use to tokenize your data? Can you copy the code to replicate the error?

According to the documentation (Handling multiple sequences - Hugging Face Course), when the length of sequence is greater than the limit of the transformer model, a solution is to truncate your sentences, as you said.

I don’t know what is your task but, here is an example to train a causal language model (Training a causal language model from scratch - Hugging Face Course) and in the section Preparing the dataset addresses the problem of working with large contexts.

Hi Ramón @rwheel,

So I am trying to run a classification task using unitary/toxic-bert · Hugging Face. This is a script of how it is currently preprocessed and runs through the prediction. toxic_bert classification · GitHub

I didn’t do any tokenization since the examples in the toxic bert had just sending the text in to the models. I will look at those resources as well, thanks

@rwheel so I looked at the documentation and seems like truncate does have it’s issues because the prediction/classification still breaks even when it goes through checks for truncation. Do you have any idea on the encoding scheme or how an input is converted for the vector form?

I’ve just found a thread that can be of your interest

On the other hand, it comes to my mind two additional resources: the paper of Attention is all you need and the book Natural Language Processing with Transformers, in which you can find good diagrams that explain the encoder. There is a repository on github of that book. Although all chapters are not released, you can see the images and some code (