SentencePiece tokenizer encodes to unknown token

sytelus · August 2, 2023, 9:30am

I am using HuggigFace implementation of SentencePiece tokenizer, i.e., SentencePieceBPETokenizer and SentencePieceUnigramTokenizer classes. I train these tokenizers on dataset which has no unicode characters and then try to encode the string that does have unicode characters.

My understanding is that SentencePiece is lossless and reversible and therefore it should always encode out-of-vocabulary tokens such that it can be decoded to same string, just like ByteLevelBPETokenizer tokenizer. So, theoretically, SentencePiece shouldn’t even need <unk> as special token. However, HuggingFace implementation does have parameter to specify unknown token as special token and it always encodes unseen unicode characters in input string as <unk>.

My questions are,

Is this expected with SentencePiece in general and therefore its claim being lossless not really true?
Is this specific to HuggingFace implementation (but not to Google’s)?
Is there anyway to make HuggingFace implementation perfectly lossless just like ByteLevelBPETokenizer?

Thanks.

Topic		Replies	Views
SentencePiece tokenizer Beginners	2	137	February 22, 2025
Unk_token not set after training a BPETokenizer tokenizer 🤗Tokenizers	1	604	November 1, 2023
OPT special tokens 🤗Tokenizers	0	157	March 25, 2024
SentencePieceUnigramTokenizer 🤗Tokenizers	0	686	September 22, 2022
How do you use SentencePiece for BPE of sequences with no whitespace 🤗Tokenizers	1	2089	April 29, 2021

SentencePiece tokenizer encodes to unknown token

Related topics