Hey! This is because of the special token used by SentencePiece to mark the whitespace or the beginning of the word. You can check here for more about how sentence piece algorithm works
In the case of ātokenā, those are actually two different tokens in str format and you can see that by converting ids to tokens. When you used the .decode()
method it drops underlines.
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained('TeamUNIVA/Komodo_7B_v1.0.0')
>>> tokenizer.convert_ids_to_tokens(6029)
'ātoken'
>>> tokenizer.convert_ids_to_tokens(5263)
'token'