Replace special [unusedX] tokens in a tokenizer to add domain-specific words

mkarlos · October 12, 2023, 9:45pm

Let’s say I have domain-specific word that I want to add to the tokenizer I am using for fine-tuning a model further. Tokenizer for BERT is one of those tokenizers that has [unusedX] tokens. One of the ways to add new tokens is by using add_tokens or add_special_tokens method. E.g

tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
tokenizer2 = tokenizer._tokenizer
tokenizer2.add_special_tokens(["DomainSpecificWord"])
tokenizer2.encode("DomainSpecificWord").ids
# [101, 30522, 102]

However, this increases the length of tokenizer as it assigns new id to the newly added word. BERT tokenizer has almost 1000 unused tokens that can be used for this purpose. However I haven’t found an example or a documentation that shows how to achieve that.

P.S Tried using
tokenizer.vocab['DomainSpecificWord'] = tokenizer.vocab.pop('[unused701]') but didn’t work

Topic		Replies	Views
Using Custom Vocab.txt 🤗Tokenizers	0	1244	October 17, 2021
Change Gemma tokenizer unused token Beginners	1	459	January 9, 2025
Add new tokens for subwords 🤗Tokenizers	9	6832	August 11, 2020
How to add a new token without expanding the vocabulary 🤗Tokenizers	0	778	March 24, 2023
Identifying most useful domain-specific tokens for adding to the existing tokenizer Intermediate	1	480	February 2, 2024

Replace special [unusedX] tokens in a tokenizer to add domain-specific words

Related topics