SentencePiece tokenizer

FrontmanNathan · February 21, 2025, 11:05pm

Good day friends, I’m new to this forum. I am working on a transformer project for language translation. Albeit I do have some concerns regarding the tokenizer I am using.

I am using the SentencePiece tokenizer library to train my corpus from scratch (a low resource language), but I do notice that special tokens are encoded with an extra token in front of it.

My special tokens defined and assigned unique IDs included the padding=0, BOS=2, EOS=3 and the UNK=1tokens.

After training, I realised that encoding token returns [31704, 2] and decoding gives the token back. Also decoding just 2 returns the same token while decoding just 31704 returns an empty string.

Now this was the behaviour for all of my other special tokens [31704, my_assigned_id] whereas I expected only my_assigned_id for my special tokens.

Also I’ll like to state that my token was actually decoded as subword, other special tokens would have been decoded as subword as well except that I added them to the list of user_defined_symbols (sentencepiece does not allow adding UNK to both control and user_defined_symbols).

Now the main problem with this extra token in front of my special tokens is that it would course the decoder input and output to be mismatched. For example, supposed I had:
[“I like bread”], my encoder input should be [“ I like bread”] encoded as [2, 56, 67, 100] and the target output [I like bread ] encoded as [56, 67, 100, 3] where 2 and 3 represents the and the tokens respectively.

On the contrast, the encoded values are [31704, 2, 56, 67, 100] and [56, 67, 100, 31704, 3]. Now when the model suppose to take in the token for a target of the word “I” it would take in 31704 and the target would be 56 for “I”. In the second step, it would take in 2, the actual token in my opinion and the target would be 67 for “like”. This would continue till the model takes in like and have “” (empty string, token 31704) as the target and finally takes in bread and have as the final target.

This mismatch caused by 31704 token seems like a problem to me and my solution was to filter out all 31704 tokens, nevertheless I realised that sometimes the reconstructed (decoded) words may lack proper spacing around special tokens.

Also I’ll like to drop the note that sometimes based on parameter settings during the SentencePiece training, the special token 31704 was decoded to “" yet it was not the hyphen in itself neither does it seem as the real space character which uses "” as well. So that encoding and setting the out_type=str returns “", “” or [31704, 2] in integer. Same for other tokens, just "” before the actual token.

I need help to understand this behaviour, I believe there’s sth I’m not doing rightly.

FrontmanNathan · February 22, 2025, 2:25am

Setting add_dummy_prefix to False solved my problem.

FrontmanNathan · February 22, 2025, 9:33am

Sorry this introduced sth unpleasant. By default SentencePiece adds space (“_”) to the beginning of every word to keep them the same. So that:
_world
and
_hello_world would have the same token for the word world regardless of it position.

But setting add_dummy_prefix=False makes:
world
different from the world in:
_hello_world.

This unnecessarily adds to the size of my vocabulary and remove the naturalness.

Topic		Replies	Views
SentencePiece tokenizer encodes to unknown token 🤗Tokenizers	0	883	August 2, 2023
2 tokens for one character in T5 🤗Tokenizers	2	1622	August 10, 2023
Tokenization compared to sentencepiece 🤗Tokenizers	0	91	September 11, 2024
How to avoid PreTrainedTokenizerFast.decode to add space between tokens 🤗Transformers	3	49	April 22, 2025
SentencePieceProcessor encoding differs from AutoTokenizer, how can that be? Beginners	0	862	December 12, 2023

SentencePiece tokenizer

Related topics