Difference between tokenizer and convert_tokens_to_ids

I was trying to convert tokens that contain spaces into ids and realized that I don’t get the same result if I use convert_tokens_to_ids. Shouldn’t this map to the same ids?
Thank you for your help.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
            "meta-llama/Meta-Llama-3-8B",
            use_fast=False,
        )

# This returns None for both
tokenizer.convert_tokens_to_ids([" "," ,"])
# This returns the correct ids (220 and 1174)
tokenizer([" "," ,"],add_special_tokens=False).input_ids
# To get the right tokens I need to replace spaces with Ä , shouldn't this be handle by the convert_tokens_to_ids method?
tokenizer.convert_tokens_to_ids(["Ä ","Ä ,"])
1 Like