RobertaTokenizer: How to enable masking of custom special tokens

adamits · April 24, 2021, 7:11pm

Hi!

I am trying to include some of my vocabulary as special tokens in RobertaTokenizer, bu t have noticed it does not mask them properly for the MLM objective:

tokenizer = RobertaTokenizer.from_pretrained(args.tokenizer_path,  additional_special_tokens=["[SPECIAL_TOK]")

t.all_special_ids → [[0, 2, 3, 2, 1, 0, 4, 32000]]

t("A test [SPECIAL_TOK] now", return_special_tokens_mask=True)

→ {'input_ids': [0, 107, 320, 32000, 37, 2], 'special_tokens_mask': [1, 0, 0, 0, 0, 1], 'attention_mask': [1, 1, 1, 1, 1, 1]}

I expect 'special_tokens_mask' to be [1, 0, 0, 1, 0, 1]. Do I just need to overwrite the RobertaForMaskedLM collator to mask my custom special tokens? Or why is this happening? For context, it rained a custom BPE with the modul:

from tokenizers.implementations import ByteLevelBPETokenizer

And I set special_tokens in there to be atomic. I also do not want these to be masked/predicted when training my LM.

adamits · April 24, 2021, 7:50pm

I think it may be be that the term special_tokens is just overloaded in HuggingFace, and the mask is only for masking <s> and </s>

Topic		Replies	Views
Why does RoBERTa behave differently if I provide a corpus that contains special tokens? Models	4	894	September 23, 2020
Further pre-training the tokenizer? 🤗Tokenizers	0	821	April 30, 2022
Sequence masking 🤗Transformers	0	379	April 25, 2022
Creating a custom tokenizer for Roberta Beginners	5	4322	August 1, 2021
Error training MLM with Roberta Tokenizer 🤗Tokenizers	1	1446	September 17, 2023

RobertaTokenizer: How to enable masking of custom special tokens

Related topics