Change Gemma tokenizer unused token

Severino · April 9, 2024, 10:07am

I want to add a new token to the gemma tokenizer let’s call it <newtoken>. I would like to take advantage of the fact that the vocabulary of this tokenizer has unused tokens (<unused0>, <unused1>, …).
The first thing I tried was to add_token("<newtoken>") but this only append the token to the end of the vocabulary instead of replacing the unused ones.
Is there an expected way to achieve the replacement of the unused tokens? Something like: add_token("<newtoken>", replace_unused=True)

ksopyla · January 9, 2025, 10:19am

I have just ask bing copilot and give me code snippet

from transformers import AutoTokenizer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Define the mapping of old tokens to new tokens
token_mapping = {
    "<unused0>": "<NEW_TOKEN1>",
    "<unused1>": "<NEW_TOKEN2>"
}

# Update the tokenizer's vocabulary
for old_token, new_token in token_mapping.items():
    if old_token in tokenizer.get_vocab():
        token_id = tokenizer.convert_tokens_to_ids(old_token)
        tokenizer.add_tokens([new_token])
        new_token_id = tokenizer.convert_tokens_to_ids(new_token)
        tokenizer.vocab[new_token] = new_token_id
        del tokenizer.vocab[old_token]

# Save the updated tokenizer
tokenizer.save_pretrained("./updated_tokenizer")

I haven’t tested it and can’t confirm it works.

There is also github issue: how to replace the existing token in a tokenizer · Issue #27974 · huggingface/transformers

Topic		Replies	Views
How to replace a existing token in a sentencepiece tokenizer Beginners	0	343	December 12, 2023
Replace special [unusedX] tokens in a tokenizer to add domain-specific words Intermediate	0	1105	October 12, 2023
How to add a new token without expanding the vocabulary 🤗Tokenizers	0	778	March 24, 2023
Removing tokens from the GPT tokenizer 🤗Transformers	2	1992	August 20, 2024
Find which tokens are unknown in new data 🤗Tokenizers	0	535	September 2, 2022

Change Gemma tokenizer unused token

Related topics