ERROR?why encoding [MASK] before '.' would gain a idx 13?

I found that if I use tokenizer for BERT,and decode some sentence with [MASK] before ‘.’,it would gain an additional idx 13
like this:
tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2') inputs = tokenizer("The capital of France is [MASK].", return_tensors="pt")
the ‘input_ids’ would be:
tensor([[ 2, 14, 1057, 16, 714, 25, 4, 13, 9, 3]])
4 refers to [MASK], 9 refers to ‘.’, but 13 refers to None
this would be happen when [MASK] is before ‘.’
is there something wrong?

You don’t generally mask works by replacing the text of the word with “[MASK]”. You usually encode the text firest “The capital of France is Paris.” and then replace the token for Paris with the Mask token. Perhaps that’s the reason for the addition of toke 13?

1 Like

I am not sure what the cause is for this. So in sentencepiece you typically have a special character before the subword token. In this case, apparently 13 in vocab is the empty token . I am not sure why it is needed in this case though. For those who want to try stuff out, here is a MVCE:

from transformers import AlbertTokenizer

tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')
inputs = tokenizer.encode("The capital of France is [MASK].")
print(inputs)
# [2, 14, 1057, 16, 714, 25, 4, 13, 9, 3]
print(tokenizer.convert_ids_to_tokens(inputs))
# ['[CLS]', '▁the', '▁capital', '▁of', '▁france', '▁is', '[MASK]', '▁', '.', '[SEP]']
print(tokenizer.decode(inputs))
# [CLS] the capital of france is[MASK].[SEP]

Perhaps @lysandre (ALBERT) or @mfuntowicz (tokenizers) have an idea.

If anything I’d expect the ‘_’ token to appear before rather than after ‘[MASK]’. i.e

from transformers import AlbertTokenizer

tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')
inputs = tokenizer.encode("The capital of France is Paris.")
print(inputs)
# [2, 14, 1057, 16, 714, 25, 1162, 9, 3]
print(tokenizer.convert_ids_to_tokens(inputs))
# ['[CLS]', '▁the', '▁capital', '▁of', '▁france', '▁is', '▁paris', '.', '[SEP]']
print(tokenizer.decode(inputs))
# [CLS] the capital of france is paris.[SEP]

I suspect it’s something to do with the fact that [MASK] is a special, all other word tokens will start with _. Perhaps there’s a bug where ’ [MASK]’ maps to ‘[MASK]’, ‘_’ rather than ‘_’ ,’[MASK]’?

Or perhaps it’s just not supported (tokenising text which actually contains literal specials?)

this would be a solution.
but how to say the tutorial shows in https://huggingface.co/transformers/model_doc/albert.html?

This behavior is also present in other models, e.g. in xlm-roberta-base where a token with id 6 is injected. The token is empty string '', which can be quite confusing. This bug should be addressed or at least it should be indicated that placing mask tokens is not recommended.