ERROR?why encoding [MASK] before '.' would gain a idx 13?

yucuinan · December 20, 2020, 1:27pm

I found that if I use tokenizer for BERT,and decode some sentence with [MASK] before ‘.’,it would gain an additional idx 13
like this:
tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2') inputs = tokenizer("The capital of France is [MASK].", return_tensors="pt")
the ‘input_ids’ would be:
tensor([[ 2, 14, 1057, 16, 714, 25, 4, 13, 9, 3]])
4 refers to [MASK], 9 refers to ‘.’, but 13 refers to None
this would be happen when [MASK] is before ‘.’
is there something wrong?

david-waterworth · December 21, 2020, 10:11pm

You don’t generally mask works by replacing the text of the word with “[MASK]”. You usually encode the text firest “The capital of France is Paris.” and then replace the token for Paris with the Mask token. Perhaps that’s the reason for the addition of toke 13?

BramVanroy · December 22, 2020, 9:46am

I am not sure what the cause is for this. So in sentencepiece you typically have a special character ▁ before the subword token. In this case, apparently 13 in vocab is the empty token ▁. I am not sure why it is needed in this case though. For those who want to try stuff out, here is a MVCE:

from transformers import AlbertTokenizer

tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')
inputs = tokenizer.encode("The capital of France is [MASK].")
print(inputs)
# [2, 14, 1057, 16, 714, 25, 4, 13, 9, 3]
print(tokenizer.convert_ids_to_tokens(inputs))
# ['[CLS]', '▁the', '▁capital', '▁of', '▁france', '▁is', '[MASK]', '▁', '.', '[SEP]']
print(tokenizer.decode(inputs))
# [CLS] the capital of france is[MASK].[SEP]

Perhaps @lysandre (ALBERT) or @mfuntowicz (tokenizers) have an idea.

david-waterworth · December 22, 2020, 10:52pm

If anything I’d expect the ‘_’ token to appear before rather than after ‘[MASK]’. i.e

from transformers import AlbertTokenizer

tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')
inputs = tokenizer.encode("The capital of France is Paris.")
print(inputs)
# [2, 14, 1057, 16, 714, 25, 1162, 9, 3]
print(tokenizer.convert_ids_to_tokens(inputs))
# ['[CLS]', '▁the', '▁capital', '▁of', '▁france', '▁is', '▁paris', '.', '[SEP]']
print(tokenizer.decode(inputs))
# [CLS] the capital of france is paris.[SEP]

I suspect it’s something to do with the fact that [MASK] is a special, all other word tokens will start with _. Perhaps there’s a bug where ’ [MASK]’ maps to ‘[MASK]’, ‘_’ rather than ‘_’ ,’[MASK]’?

Or perhaps it’s just not supported (tokenising text which actually contains literal specials?)

yucuinan · December 28, 2020, 8:02am

this would be a solution.
but how to say the tutorial shows in https://huggingface.co/transformers/model_doc/albert.html？

matus · December 27, 2021, 9:31pm

This behavior is also present in other models, e.g. in xlm-roberta-base where a token with id 6 is injected. The token is empty string '', which can be quite confusing. This bug should be addressed or at least it should be indicated that placing mask tokens is not recommended.

Topic		Replies	Views
SentencePiece tokenizer Beginners	2	129	February 22, 2025
Tokenizer decoding using BERT, RoBERTa, XLNet, GPT2 Beginners	7	8433	September 21, 2020
Where in the code does masking of tokens happen when pretraining BERT Beginners	5	7268	August 17, 2020
Having Multiple [MASK] tokens in a sentence Beginners	2	2489	April 8, 2021
How does `tokenizer().input_ids` work and how different it is from tokenizer.encode() before `model.generate()` and decoding step? 🤗Tokenizers	1	2883	February 22, 2023

ERROR?why encoding [MASK] before '.' would gain a idx 13?

Related topics