Why does RoBERTa behave differently if I provide a corpus that contains special tokens?

Hello all,
Recently I’ve been working on training from scratch a RoBERTa model starting from the code in this tutorial.

I am working with a specific corpus I prepared according to my own format
<s> ID 10 <i> COUNTRY USA <i> CAPITAL Washington DC </s>
and I noticed that one of the parameters that can be passed to the tokenizer.encoder_plus function is add_special_tokens.

If add_special_tokens=True, the result of the encoding of the sentence
<s> ID 10 <i> COUNTRY USA <i> CAPITAL Washington DC </s>
becomes
<s> <s> ID 10 <i> COUNTRY USA <i> CAPITAL Washington DC </s> </s>
and the special_tokens_mask is 1 0 0 … 0 1

When I tried add_special_tokens=False on the same sentence
<s> ID 10 <i> COUNTRY USA <i> CAPITAL Washington DC </s>
the result of the encoding was correct:
<s> ID 10 <i> COUNTRY USA <i> CAPITAL Washington DC </s>
However, the special_tokens_mask remains 0 0 … 0 0.

After testing both versions, the result I got from the first was very good, while the second failed.

This raises a few issues that I wasn’t able to solve on my own:

  • How can I access the special_tokens_mask to correct it to what it should be?
  • Where does RoBERTa make use of that mask, if it does?
  • Is there a method for setting the mask to something I want? e.g. the mask for <s> ID 10 <i> COUNTRY USA </s> should be 1 0 0 1 0 0 1 if <s>, </s> and <i> should all be treated as special tokens.
  • If RoBERTa is not the correct model to do this, what model should I go for?

Thanks a lot!

This actually feels more like a bug than a problem on your end. I suspect that the tokenization between the two is identical (i.e. <s> gets the same ID in both cases), but that means that the special_tokens_mask should also be the same. Best to wait for others who might be more certain.

Hello! Thanks for the reply. I checked what you mentioned and yes, you were right.
I tried to encode the sentence at the top setting add_special_tokens=False and True.
Token 4 is <s>, which is both the <cls> token and the <bos> token, while token 6 is </s>, which acts as sep_token and eos_token.

The first case has add_special_tokens=False and its special token mask is full of 0’s, the first case has add_special_tokens=True and as expected the <bos> and <eos> tokens were added by the algorithm. The special tokens mask only shows the first and last tokens as “1”, while all the other 4’s and 6’s are missing.

<s> ID 10 </s><s> NAME Trevor </s> <s> COUNTRY USA </s><s> CAPITAL Washington DC </s>
[4, 0, 232, 28, 27, 6, 4, 1, 232, 63, 93, 80, 97, 90, 93, 6, 4, 2, 232, 64, 62, 44, 6, 4, 3, 232, 66, 76, 94, 83, 84, 89, 82, 95, 90, 89, 232, 47, 46, 6]

[4, 4, 0, 232, 28, 27, 6, 4, 1, 232, 63, 93, 80, 97, 90, 93, 6, 4, 2, 232, 64, 62, 44, 6, 4, 3, 232, 66, 76, 94, 83, 84, 89, 82, 95, 90, 89, 232, 47, 46, 6, 6]
[1, 0, 0,   0,  0,  0, 0, 0, 0,   0,  0,  0,  0,  0,  0,  0, 0, 0, 0,   0,  0,  0,  0, 0, 0, 0,   0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,   0,  0,  0, 0, 1]

I asked this question in an issue on the main repository, and received an in-depth answer there.

cc @mfuntowicz

This feels like a bug. Special characters do not seem to be encoded correctly in the RoBERTaTokenizer.