Llama2 tokenizer duplicate ids

Hi community,

I was hoping if anyone can help explain why two different ids are decoded both to the same token “A” in the below code?

`import transformers

llm = ‘meta-llama/Llama-2-7b-chat-hf’

tokenizer = transformers.AutoTokenizer.from_pretrained(
llm,
use_auth_token=True,
)

print(‘id 29909:’, tokenizer.decode([29909]))
print(‘id 319:’, tokenizer.decode([319]))

Thanks!

It’s because the first one is just an A, while the other one is an A with a prefix space (▁A).

You can see this like this:

>>> tokenizer.convert_ids_to_tokens([29909])
['A']
>>> tokenizer.convert_ids_to_tokens([319])
['▁A']

Many tokens have prefix spaces attached to them. With transformer tokenizers, spaces are usually not their own tokens, they’re joined with other tokens. For example, when you tokenized my previous two sentences, you can see this in action:

>>> sentences = "Many tokens have prefix spaces attached to them. With transformer tokenizers, spaces are usually not their own tokens, they're joined with other tokens."
>>> tokenizer.tokenize(sentences)
['▁Many', '▁tokens', '▁have', '▁prefix', '▁spaces', '▁attached', '▁to', '▁them', '.', '▁With', '▁transform', 'er', '▁token', 'izers', ',', '▁spaces', '▁are', '▁usually', '▁not', '▁their', '▁own', '▁tokens', ',', '▁they', "'", 're', '▁joined', '▁with', '▁other', '▁tokens', '.']

But when you decode a token starting with a prefix space, the space is usually stripped out for readability, giving the appearance that it isn’t there.