Decode token IDs into a list (not a single string)

steventrouble · June 12, 2023, 10:58pm

tokenizer.convert_ids_to_tokens returns:

['ĠDrive', 'Ġwas', 'Ġhad', 'Ġwalked', "'s", ',', 'Ġlooked', ...]

I need the tokens without the special characters. decode does not work, because it only returns a single string.

Is there a function that outputs the plain tokens as a list?

ArthurZ · June 22, 2023, 7:11am

Hey! Not sure I completely understand, but the tokens that you have here are the plain tokens, as they are in the vocab / merge. You should modify the tokenizer if you do not want it to add the spiece token at the beginning. Which tokenizer are you using?

steventrouble · June 23, 2023, 3:40am

Thanks for the ping!

I was using the GPT byte level tokenizer.

I’m not sure if this is a hack, but to get the behavior I wanted, I just passed the token ids into decode_batch instead, and that returned each token without the odd encoding.

ArthurZ · September 18, 2023, 9:17pm

It’s not a hack, but something I wish to improve! IMO batch_decode and decode should be merged into one as we only have encode

lone17 · March 11, 2025, 8:53pm

Wow thank you ! Faced this today and this “hack” saved me. Btw after 2 years it’s still just a “hack” haha

Topic		Replies	Views
Issue with Decoding in HuggingFace 🤗Tokenizers	2	3871	March 24, 2022
Remove only certain special token id during tokenizer decode 🤗Tokenizers	3	2588	October 26, 2022
Efficient detokenization method 🤗Transformers	3	2046	January 28, 2021
Difference between tokenizer and convert_tokens_to_ids 🤗Tokenizers	0	308	May 12, 2024
GPT2Tokenizer.decode maps unicode sequences to the same string '�' 🤗Tokenizers	3	1199	March 15, 2023

Decode token IDs into a list (not a single string)

Related topics