Different size of Roberta-base tokenizer and model embedding

tntchung · February 24, 2022, 4:02pm

Hi all,

One quick question on the size of roberta tokenizer and model.

I notice that the model_max_len of ‘roberta-base’ tokenizer is 512 while the max_position_embeddings of roberta-base model is set at 514. May I know the reason behind this.

I think the bos and eos token have already been added in the map(preprocess_function, batched=True).

If both are set as same value, a error mesage is received (IndexError: index out of range in self)

Thanks.

Srnl · March 1, 2022, 1:22pm

See RoBERTa and 514 · Issue #1187 · pytorch/fairseq · GitHub

Topic		Replies	Views
Positional encoding error in RoBERTa 🤗Transformers	1	331	October 2, 2023
Configure RobertaTokenizer 🤗Tokenizers	0	393	October 4, 2022
Pretraining RoBERTa from scratch breaks down when using tokenizer with smaller vocabulary Beginners	2	1677	March 7, 2021
IndexError: index out of range in self on train() Beginners	0	1227	June 19, 2023
[URGENT] Issues with Training RoBERTa Model for Text Prediction with Fill Mask Task 🤗Transformers	6	216	March 19, 2024

Different size of Roberta-base tokenizer and model embedding

Related topics