Decode token IDs into a list (not a single string)

tokenizer.convert_ids_to_tokens returns:

['Ä Drive', 'Ä was', 'Ä had', 'Ä walked', "'s", ',', 'Ä looked', ...]

I need the tokens without the special characters. decode does not work, because it only returns a single string.

Is there a function that outputs the plain tokens as a list?

Hey! Not sure I completely understand, but the tokens that you have here are the plain tokens, as they are in the vocab / merge. You should modify the tokenizer if you do not want it to add the spiece token at the beginning. Which tokenizer are you using?

Thanks for the ping!

I was using the GPT byte level tokenizer.

I’m not sure if this is a hack, but to get the behavior I wanted, I just passed the token ids into decode_batch instead, and that returned each token without the odd encoding.

It’s not a hack, but something I wish to improve! IMO batch_decode and decode should be merged into one as we only have encode