Hello all,
Recently I’ve been working on training from scratch a RoBERTa model starting from the code in this tutorial.
I am working with a specific corpus I prepared according to my own format
<s> ID 10 <i> COUNTRY USA <i> CAPITAL Washington DC </s>
and I noticed that one of the parameters that can be passed to the tokenizer.encoder_plus
function is add_special_tokens
.
If add_special_tokens=True
, the result of the encoding of the sentence
<s> ID 10 <i> COUNTRY USA <i> CAPITAL Washington DC </s>
becomes
<s> <s> ID 10 <i> COUNTRY USA <i> CAPITAL Washington DC </s> </s>
and the special_tokens_mask
is 1 0 0 … 0 1
When I tried add_special_tokens=False
on the same sentence
<s> ID 10 <i> COUNTRY USA <i> CAPITAL Washington DC </s>
the result of the encoding was correct:
<s> ID 10 <i> COUNTRY USA <i> CAPITAL Washington DC </s>
However, the special_tokens_mask
remains 0 0 … 0 0.
After testing both versions, the result I got from the first was very good, while the second failed.
This raises a few issues that I wasn’t able to solve on my own:
- How can I access the
special_tokens_mask
to correct it to what it should be? - Where does RoBERTa make use of that mask, if it does?
- Is there a method for setting the mask to something I want? e.g. the mask for
<s> ID 10 <i> COUNTRY USA </s>
should be1 0 0 1 0 0 1
if<s>
,</s>
and<i>
should all be treated as special tokens. - If RoBERTa is not the correct model to do this, what model should I go for?
Thanks a lot!