['ĠDrive', 'Ġwas', 'Ġhad', 'Ġwalked', "'s", ',', 'Ġlooked', ...]
I need the tokens without the special characters.
decode does not work, because it only returns a single string.
Is there a function that outputs the plain tokens as a list?
Hey! Not sure I completely understand, but the tokens that you have here are the
plain tokens, as they are in the vocab / merge. You should modify the tokenizer if you do not want it to add the
spiece token at the beginning. Which tokenizer are you using?
Thanks for the ping!
I was using the GPT byte level tokenizer.
I’m not sure if this is a hack, but to get the behavior I wanted, I just passed the token ids into
decode_batch instead, and that returned each token without the odd encoding.
It’s not a hack, but something I wish to improve! IMO
decode should be merged into one as we only have