I want to add a new token to the gemma tokenizer let’s call it <newtoken>
. I would like to take advantage of the fact that the vocabulary of this tokenizer has unused tokens (<unused0>
, <unused1>
, …).
The first thing I tried was to add_token("<newtoken>")
but this only append the token to the end of the vocabulary instead of replacing the unused ones.
Is there an expected way to achieve the replacement of the unused tokens? Something like: add_token("<newtoken>", replace_unused=True)
3 Likes
I have just ask bing copilot and give me code snippet
from transformers import AutoTokenizer
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Define the mapping of old tokens to new tokens
token_mapping = {
"<unused0>": "<NEW_TOKEN1>",
"<unused1>": "<NEW_TOKEN2>"
}
# Update the tokenizer's vocabulary
for old_token, new_token in token_mapping.items():
if old_token in tokenizer.get_vocab():
token_id = tokenizer.convert_tokens_to_ids(old_token)
tokenizer.add_tokens([new_token])
new_token_id = tokenizer.convert_tokens_to_ids(new_token)
tokenizer.vocab[new_token] = new_token_id
del tokenizer.vocab[old_token]
# Save the updated tokenizer
tokenizer.save_pretrained("./updated_tokenizer")
I haven’t tested it and can’t confirm it works.
There is also github issue: how to replace the existing token in a tokenizer · Issue #27974 · huggingface/transformers
1 Like