It’s because the first one is just an A, while the other one is an A with a prefix space (▁A
).
You can see this like this:
>>> tokenizer.convert_ids_to_tokens([29909])
['A']
>>> tokenizer.convert_ids_to_tokens([319])
['▁A']
Many tokens have prefix spaces attached to them. With transformer tokenizers, spaces are usually not their own tokens, they’re joined with other tokens. For example, when you tokenized my previous two sentences, you can see this in action:
>>> sentences = "Many tokens have prefix spaces attached to them. With transformer tokenizers, spaces are usually not their own tokens, they're joined with other tokens."
>>> tokenizer.tokenize(sentences)
['▁Many', '▁tokens', '▁have', '▁prefix', '▁spaces', '▁attached', '▁to', '▁them', '.', '▁With', '▁transform', 'er', '▁token', 'izers', ',', '▁spaces', '▁are', '▁usually', '▁not', '▁their', '▁own', '▁tokens', ',', '▁they', "'", 're', '▁joined', '▁with', '▁other', '▁tokens', '.']
But when you decode a token starting with a prefix space, the space is usually stripped out for readability, giving the appearance that it isn’t there.