Positional encoding error in RoBERTa

Pepper08 · October 1, 2023, 7:40pm

Hello all,

I’m using a RobertaForMaskedLM model initialized with the following configuration:

config = RobertaConfig(vocab_size = tokenizer.vocab_size, max_position_embeddings = 128)

because I padded my input tokens to a size of 128.

However, during the training, I get the following error message:

IndexError: index out of range in self

which comes from the position_ids in the position_embedding layer.

Indeed, the positional embedding layer is initialized with the argument max_position_embeddings (equals to 128 here):

self.position_embeddings = nn.Embedding(config.max_position_embeddings, ...)

whereas the function which creates position_ids generates numbers from padding_idx to padding_idx + max_position_embeddings (so 129 > 127).

Therefore, it seems that one needs to choose model_max_length = (128 + padding_idx + 1) to avoid errors, but I can’t see any explanation of this in the documentation. Can anyone explain what I haven’t understood?

Sandy1857 · October 2, 2023, 12:52pm

When you tokenize a text, a bos token and eos token is added to the beginning and end of the tokenized output.

So if your text has >= 128 tokens, with truncation set to true, the tokens will be truncated to 126 so that there is space to add the bos and eos tokens. The extra 2 comes from this.

Topic		Replies	Views
Different size of Roberta-base tokenizer and model embedding Beginners	1	1121	March 1, 2022
PyTorch version Beginners	7	1683	July 12, 2022
Claritifcation about the `max_position_embeddings` argument 🤗Transformers	1	498	January 27, 2023
IndexError: index out of range in self on train() Beginners	0	1227	June 19, 2023
Positional Encoding error, Protein Bert Model Intermediate	2	653	October 25, 2020

Positional encoding error in RoBERTa

Related topics