I am using HuggigFace implementation of SentencePiece tokenizer, i.e.,
SentencePieceUnigramTokenizer classes. I train these tokenizers on dataset which has no unicode characters and then try to encode the string that does have unicode characters.
My understanding is that SentencePiece is lossless and reversible and therefore it should always encode out-of-vocabulary tokens such that it can be decoded to same string, just like
ByteLevelBPETokenizer tokenizer. So, theoretically, SentencePiece shouldn’t even need
<unk> as special token. However, HuggingFace implementation does have parameter to specify unknown token as special token and it always encodes unseen unicode characters in input string as
My questions are,
- Is this expected with SentencePiece in general and therefore its claim being lossless not really true?
- Is this specific to HuggingFace implementation (but not to Google’s)?
- Is there anyway to make HuggingFace implementation perfectly lossless just like