SentencePiece tokenizer encodes to unknown token

I am using HuggigFace implementation of SentencePiece tokenizer, i.e., SentencePieceBPETokenizer and SentencePieceUnigramTokenizer classes. I train these tokenizers on dataset which has no unicode characters and then try to encode the string that does have unicode characters.

My understanding is that SentencePiece is lossless and reversible and therefore it should always encode out-of-vocabulary tokens such that it can be decoded to same string, just like ByteLevelBPETokenizer tokenizer. So, theoretically, SentencePiece shouldn’t even need <unk> as special token. However, HuggingFace implementation does have parameter to specify unknown token as special token and it always encodes unseen unicode characters in input string as <unk>.

My questions are,

  1. Is this expected with SentencePiece in general and therefore its claim being lossless not really true?
  2. Is this specific to HuggingFace implementation (but not to Google’s)?
  3. Is there anyway to make HuggingFace implementation perfectly lossless just like ByteLevelBPETokenizer?


1 Like