I am working with the mistralai/Mistral-7B-v0.1 model. I loaded the tokenizer via:
tokenizer = transformers.AutoTokenizer.from_pretrained('mistralai/Mistral-7B-v0.1')
then ran the following code:
tokenizer.decode([69])
Produces output B.
tokenizer.decode([198])
Produces output �. This is perhaps understandable since token ID 198 corresponds to token <0xC3>, which is the hex ASCII code for à and may be unprint-able.
But then:
tokenizer.decode([69,198])
Produces output ��. I don’t know why it’s producing this instead of B�.
Any help will be appreciated!