How can I remove unwanted sub-tokens from GPT vocabulary or tokenizer? I have tried an existing approach that was used for a ROBERTa kind of model as shown below (Removing tokens from the tokenizer · Issue #15032 · huggingface/transformers · GitHub). However it fails at the point of initializing the “model” component of the backend_tokenizer with the new vocabulary.
#1. Get your tokenizer and the list of tokens you want to remove
import json
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
# get all tokens with "unused" in target_tokenizer
unwanted_words = [ 'ply', 'Ġmor','Ġprovide','IC','ung','Ġparty', 'Ġexist', 'Ġmag',]
#2. Get the arguments that allowed to initialize the "model" component of the backend_tokenizer.
model_state = json.loads(tokenizer.backend_tokenizer.model.__getstate__())
print(len(model_state["vocab"]))
#3. Modify the initialization arguments, in particular the vocabulary to remove the tokens we don't want
# remove all unwanted tokens from the vocabulary
for word in unwanted_words:
del model_state["vocab"][word]
print(len(model_state["vocab"]))
#4. Intitialize again the "model" component of the backend_tokenizer with the new vocabulary
from tokenizers import models
model_class = getattr(models, model_state.pop("type"))
tokenizer.backend_tokenizer.model = model_class(**model_state)
print(len(tokenizer.vocab))
And below is the error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-21-fa908d23c419> in <module>
30 model_class = getattr(models, model_state.pop("type"))
31
---> 32 tokenizer.backend_tokenizer.model = model_class(**model_state)
33
34 print(len(tokenizer.vocab))
TypeError: argument 'merges': failed to extract enum PyMerges ('Merges | Filename')
- variant Merges (Merges): TypeError: failed to extract field PyMerges::Merges.0, caused by TypeError: 'str' object cannot be converted to 'PyTuple'
- variant Filename (Filename): TypeError: failed to extract field PyMerges::Filename.0, caused by TypeError: 'list' object cannot be converted to 'PyString'
What other methods can I use or refer to? The original script I adapter was used for ROBERTa which uses Sentencepiece but GPT uses BPE.