Hi!
I am trying to include some of my vocabulary as special tokens in RobertaTokenizer
, bu t have noticed it does not mask them properly for the MLM objective:
tokenizer = RobertaTokenizer.from_pretrained(args.tokenizer_path, additional_special_tokens=["[SPECIAL_TOK]")
t.all_special_ids
→ [[0, 2, 3, 2, 1, 0, 4, 32000]]
t("A test [SPECIAL_TOK] now", return_special_tokens_mask=True)
→ {'input_ids': [0, 107, 320, 32000, 37, 2], 'special_tokens_mask': [1, 0, 0, 0, 0, 1], 'attention_mask': [1, 1, 1, 1, 1, 1]}
I expect 'special_tokens_mask'
to be [1, 0, 0, 1, 0, 1]
. Do I just need to overwrite the RobertaForMaskedLM collator to mask my custom special tokens? Or why is this happening? For context, it rained a custom BPE with the modul:
from tokenizers.implementations import ByteLevelBPETokenizer
And I set special_tokens in there to be atomic. I also do not want these to be masked/predicted when training my LM.