Hi, I have trained the tokenizer using the model BPE and pre tokenizer as ByteLevel tokenizer = Tokenizer(models.BPE(unk_token="[UNK]")) tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel() tokenizer.decoder = decoders.ByteLevel()
Now, my vocabulary is saved in bytes and tokenizer.tokenize give me output in bytes too which is obvious.
tokenizer.tokenize
output is
['Ġन',
'à¥ĩ',
'प',
'ा',
'ल',
'à¥Ģ',
'Ġà¤Ń',
'ा',
'ष',
'ा',
'म',
'ा',
'Ġय',
'à¥ĭ',
'Ġà¤ıà¤ķ',
'Ġà¤īद',
'ा',
'हरण',
'Ġह',
'à¥ĭ।']
. Is there way to save my vocabulary in unicode character rather than bytes and show tokens in unicode characters too?