Different size of Roberta-base tokenizer and model embedding

Hi all,

One quick question on the size of roberta tokenizer and model.

I notice that the model_max_len of ‘roberta-base’ tokenizer is 512 while the max_position_embeddings of roberta-base model is set at 514. May I know the reason behind this.

I think the bos and eos token have already been added in the map(preprocess_function, batched=True).

If both are set as same value, a error mesage is received (IndexError: index out of range in self)


See RoBERTa and 514 · Issue #1187 · pytorch/fairseq · GitHub

