Llama2 tokenizer duplicate ids

Kenkentron · August 22, 2023, 5:18pm

Hi community,

I was hoping if anyone can help explain why two different ids are decoded both to the same token “A” in the below code?

`import transformers

llm = ‘meta-llama/Llama-2-7b-chat-hf’

tokenizer = transformers.AutoTokenizer.from_pretrained(
llm,
use_auth_token=True,
)

print(‘id 29909:’, tokenizer.decode([29909]))
print(‘id 319:’, tokenizer.decode([319]))

Thanks!

dblakely · August 27, 2023, 4:13pm

It’s because the first one is just an A, while the other one is an A with a prefix space (▁A).

You can see this like this:

>>> tokenizer.convert_ids_to_tokens([29909])
['A']
>>> tokenizer.convert_ids_to_tokens([319])
['▁A']

Many tokens have prefix spaces attached to them. With transformer tokenizers, spaces are usually not their own tokens, they’re joined with other tokens. For example, when you tokenized my previous two sentences, you can see this in action:

>>> sentences = "Many tokens have prefix spaces attached to them. With transformer tokenizers, spaces are usually not their own tokens, they're joined with other tokens."
>>> tokenizer.tokenize(sentences)
['▁Many', '▁tokens', '▁have', '▁prefix', '▁spaces', '▁attached', '▁to', '▁them', '.', '▁With', '▁transform', 'er', '▁token', 'izers', ',', '▁spaces', '▁are', '▁usually', '▁not', '▁their', '▁own', '▁tokens', ',', '▁they', "'", 're', '▁joined', '▁with', '▁other', '▁tokens', '.']

But when you decode a token starting with a prefix space, the space is usually stripped out for readability, giving the appearance that it isn’t there.

StephennFernandes · April 21, 2024, 3:24am

@dblakely i am working on extending llama tokenizer to newer languages, where some languages might contain english romanised script. and with lots of code-mixed data being available. what if my extended tokenizer contains few similar vocabs that are already existing in the original tokenizer. will this be an issue or is tokenizers good enough to handle such cases ?

Topic		Replies	Views
Tokenizer mapping the same token to multiple token_ids 🤗Tokenizers	4	696	April 22, 2024
How to avoid PreTrainedTokenizerFast.decode to add space between tokens 🤗Transformers	3	54	April 22, 2025
Encoding and then decodeing text is not equal 🤗Tokenizers	2	213	August 12, 2024
Added Tokens Not Decoding with Spaces 🤗Tokenizers	3	2861	January 19, 2024
Difference between tokenizer and convert_tokens_to_ids 🤗Tokenizers	0	314	May 12, 2024

Llama2 tokenizer duplicate ids

Related topics