Let’s say I have domain-specific word that I want to add to the tokenizer I am using for fine-tuning a model further. Tokenizer for BERT is one of those tokenizers that has [unusedX] tokens. One of the ways to add new tokens is by using add_tokens
or add_special_tokens
method. E.g
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
tokenizer2 = tokenizer._tokenizer
tokenizer2.add_special_tokens(["DomainSpecificWord"])
tokenizer2.encode("DomainSpecificWord").ids
# [101, 30522, 102]
However, this increases the length of tokenizer as it assigns new id to the newly added word. BERT tokenizer has almost 1000 unused tokens that can be used for this purpose. However I haven’t found an example or a documentation that shows how to achieve that.
P.S Tried using
tokenizer.vocab['DomainSpecificWord'] = tokenizer.vocab.pop('[unused701]')
but didn’t work