Difference between tokenizer and convert_tokens_to_ids

mcallara · May 12, 2024, 5:57pm

I was trying to convert tokens that contain spaces into ids and realized that I don’t get the same result if I use convert_tokens_to_ids. Shouldn’t this map to the same ids?
Thank you for your help.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
            "meta-llama/Meta-Llama-3-8B",
            use_fast=False,
        )

# This returns None for both
tokenizer.convert_tokens_to_ids([" "," ,"])
# This returns the correct ids (220 and 1174)
tokenizer([" "," ,"],add_special_tokens=False).input_ids
# To get the right tokens I need to replace spaces with Ġ, shouldn't this be handle by the convert_tokens_to_ids method?
tokenizer.convert_tokens_to_ids(["Ġ","Ġ,"])

Topic		Replies	Views
Llama2 tokenizer duplicate ids Beginners	2	1435	April 21, 2024
Convert_tokens_to_ids produces <unk> 🤗Tokenizers	1	4448	October 25, 2022
Question About XLNetTokenizer Beginners	1	318	October 21, 2022
Tokenizer mapping the same token to multiple token_ids 🤗Tokenizers	4	663	April 22, 2024
How does `tokenizer().input_ids` work and how different it is from tokenizer.encode() before `model.generate()` and decoding step? 🤗Tokenizers	1	2884	February 22, 2023

Difference between tokenizer and convert_tokens_to_ids

Related topics