Hi there!
I need to remove specific tokens from my tokenizer’s vocabulary, and I am not quite sure how to do so. Specifically, I am using Qwen2Tokenizer, a BPE tokenizer, and I would like to remove specific Chinese tokens from its vocabulary. I have tried various methods, shown below, but to no avail.
Deleting Tokens from Vocabulary
for tok in long_toks: vocab.pop(tok)
tokz.vocab = vocab
This results in unknown tokens being output. The length of vocabulary also differs in certain places. That is, `len(vocab) ≠len(tokz.vocab).
Tweaking vocab.json
and merges.txt
vocab.json
import json
with open('tokenizer/vocab.json', 'r') as f: vocab = json.load(f)
for tok in long_toks: vocab.pop(tok)
with open('tokenizer/vocab.json', 'w') as f: json.dump(vocab, f, indent=2)
merges.txt
with open('tokenizer/merges.txt', 'r') as f: merges = L(f.readlines()[1:])
While I am able to update vocab.json
, I do not know how to work with the contents of merges.txt
. This file stores pairs as follows:
(#151387) ['Ä Ä \n','Ä Ä Ä Ä \n','i n\n','Ä t\n','Ä Ä Ä Ä Ä Ä Ä Ä \n','e r\n','Ä Ä Ä \n','o n\n','Ä a\n','r e\n'...]
I am unable to determine what special characters represent Chinese characters. Is there any way I can decode or figure out what the characters such as â€˜Ä â€™ and ‘â½ Ĺ \n’ represent?
Training a Tokenizer
I feel training a tokenizer might not be feasible, because of the amount of data required? I want to have the exact same Qwen2Tokenizer, save for certain tokens removed.
Backend
import json
tokz_state = json.loads(tokz.backend_tokenizer.model.__getstate__())
for tok in long_toks: del tokz_state['vocab'][tok]
from tokenizers import models
model_class = getattr(models, tokz_state.pop('type'))
tokz.backend_tokenizer.model = model_class(**tokz_state)
However, this approach results in the following error:
---------------------------------------------------------------------------TypeError Traceback (most recent call last)Cell In[67], line 3
1 from tokenizers import models
2 model_class = getattr(models, tokz_state.pop('type'))
----> 3 tokz.backend_tokenizer.model = model_class(**tokz_state)
TypeError: argument 'merges': failed to extract enum PyMerges ('Merges | Filename')
- variant Merges (Merges): TypeError: failed to extract field PyMerges::Merges.0, caused by TypeError: 'str' object cannot be converted to 'PyTuple'
- variant Filename (Filename): TypeError: failed to extract field PyMerges::Filename.0, caused by TypeError: 'list' object cannot be converted to 'PyString'
This approach references this GitHub issue, where others also noted the same error.
I would really appreciate any pointers regarding how correctly remove tokens!