Llama2 tokenizer duplicate ids

Hi community,

I was hoping if anyone can help explain why two different ids are decoded both to the same token “A” in the below code?

`import transformers

llm = ‘meta-llama/Llama-2-7b-chat-hf’

tokenizer = transformers.AutoTokenizer.from_pretrained(
llm,
use_auth_token=True,
)

print(‘id 29909:’, tokenizer.decode([29909]))
print(‘id 319:’, tokenizer.decode([319]))

Thanks!

It’s because the first one is just an A, while the other one is an A with a prefix space (▁A).

You can see this like this:

>>> tokenizer.convert_ids_to_tokens([29909])
['A']
>>> tokenizer.convert_ids_to_tokens([319])
['▁A']

Many tokens have prefix spaces attached to them. With transformer tokenizers, spaces are usually not their own tokens, they’re joined with other tokens. For example, when you tokenized my previous two sentences, you can see this in action:

>>> sentences = "Many tokens have prefix spaces attached to them. With transformer tokenizers, spaces are usually not their own tokens, they're joined with other tokens."
>>> tokenizer.tokenize(sentences)
['▁Many', '▁tokens', '▁have', '▁prefix', '▁spaces', '▁attached', '▁to', '▁them', '.', '▁With', '▁transform', 'er', '▁token', 'izers', ',', '▁spaces', '▁are', '▁usually', '▁not', '▁their', '▁own', '▁tokens', ',', '▁they', "'", 're', '▁joined', '▁with', '▁other', '▁tokens', '.']

But when you decode a token starting with a prefix space, the space is usually stripped out for readability, giving the appearance that it isn’t there.

3 Likes

@dblakely i am working on extending llama tokenizer to newer languages, where some languages might contain english romanised script. and with lots of code-mixed data being available. what if my extended tokenizer contains few similar vocabs that are already existing in the original tokenizer. will this be an issue or is tokenizers good enough to handle such cases ?