Decoding sequence of tokens produces question marks instead of actual tokens

souryadey · September 3, 2024, 10:27am

I am working with the mistralai/Mistral-7B-v0.1 model. I loaded the tokenizer via:

tokenizer = transformers.AutoTokenizer.from_pretrained('mistralai/Mistral-7B-v0.1')

then ran the following code:

tokenizer.decode([69])

Produces output B.

tokenizer.decode([198])

Produces output �. This is perhaps understandable since token ID 198 corresponds to token <0xC3>, which is the hex ASCII code for Ã and may be unprint-able.

But then:

tokenizer.decode([69,198])

Produces output ��. I don’t know why it’s producing this instead of B�.

Any help will be appreciated!

souryadey · September 3, 2024, 10:28am

Additional details: My Python environment has tokenizers==0.19.1 and transformers==4.44.2.

Topic		Replies	Views
Data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 6952 column 3 Models	1	1179	July 4, 2024
Modifying normalizer for pretrained tokenizers don't consistently work 🤗Tokenizers	2	116	June 12, 2024
How can I check the implementation of tokenizer.decode() 🤗Transformers	6	56	September 30, 2024
Poor performance from Mistral-7B-Instruct-v0.1 Beginners	1	1547	March 1, 2024
Skew between mistral prompt in docs vs. chat template 🤗Tokenizers	2	1128	December 27, 2023

Decoding sequence of tokens produces question marks instead of actual tokens

Related topics