Positional encoding error in RoBERTa

Hello all,

I’m using a RobertaForMaskedLM model initialized with the following configuration:

config = RobertaConfig(vocab_size = tokenizer.vocab_size, max_position_embeddings = 128)

because I padded my input tokens to a size of 128.

However, during the training, I get the following error message:

IndexError: index out of range in self

which comes from the position_ids in the position_embedding layer.

Indeed, the positional embedding layer is initialized with the argument max_position_embeddings (equals to 128 here):

self.position_embeddings = nn.Embedding(config.max_position_embeddings, ...)

whereas the function which creates position_ids generates numbers from padding_idx to padding_idx + max_position_embeddings (so 129 > 127).

Therefore, it seems that one needs to choose model_max_length = (128 + padding_idx + 1) to avoid errors, but I can’t see any explanation of this in the documentation. Can anyone explain what I haven’t understood?

When you tokenize a text, a bos token and eos token is added to the beginning and end of the tokenized output.

So if your text has >= 128 tokens, with truncation set to true, the tokens will be truncated to 126 so that there is space to add the bos and eos tokens. The extra 2 comes from this.