Good day friends, I’m new to this forum. I am working on a transformer project for language translation. Albeit I do have some concerns regarding the tokenizer I am using.
I am using the SentencePiece tokenizer library to train my corpus from scratch (a low resource language), but I do notice that special tokens are encoded with an extra token in front of it.
My special tokens defined and assigned unique IDs included the padding=0, BOS=2, EOS=3 and the UNK=1tokens.
After training, I realised that encoding token returns [31704, 2] and decoding gives the token back. Also decoding just 2 returns the same token while decoding just 31704 returns an empty string.
Now this was the behaviour for all of my other special tokens [31704, my_assigned_id] whereas I expected only my_assigned_id for my special tokens.
Also I’ll like to state that my token was actually decoded as subword, other special tokens would have been decoded as subword as well except that I added them to the list of user_defined_symbols (sentencepiece does not allow adding UNK to both control and user_defined_symbols).
Now the main problem with this extra token in front of my special tokens is that it would course the decoder input and output to be mismatched. For example, supposed I had:
[“I like bread”], my encoder input should be [“ I like bread”] encoded as [2, 56, 67, 100] and the target output [I like bread ] encoded as [56, 67, 100, 3] where 2 and 3 represents the and the tokens respectively.
On the contrast, the encoded values are [31704, 2, 56, 67, 100] and [56, 67, 100, 31704, 3]. Now when the model suppose to take in the token for a target of the word “I” it would take in 31704 and the target would be 56 for “I”. In the second step, it would take in 2, the actual token in my opinion and the target would be 67 for “like”. This would continue till the model takes in like and have “” (empty string, token 31704) as the target and finally takes in bread and have as the final target.
This mismatch caused by 31704 token seems like a problem to me and my solution was to filter out all 31704 tokens, nevertheless I realised that sometimes the reconstructed (decoded) words may lack proper spacing around special tokens.
Also I’ll like to drop the note that sometimes based on parameter settings during the SentencePiece training, the special token 31704 was decoded to “" yet it was not the hyphen in itself neither does it seem as the real space character which uses "” as well. So that encoding and setting the out_type=str returns “", “” or [31704, 2] in integer. Same for other tokens, just "” before the actual token.
I need help to understand this behaviour, I believe there’s sth I’m not doing rightly.