How do I remove tokens from a BPE Tokenizer's vocabulary?

Hi there!

I need to remove specific tokens from my tokenizer’s vocabulary, and I am not quite sure how to do so. Specifically, I am using Qwen2Tokenizer, a BPE tokenizer, and I would like to remove specific Chinese tokens from its vocabulary. I have tried various methods, shown below, but to no avail.

Deleting Tokens from Vocabulary

for tok in long_toks: vocab.pop(tok)
tokz.vocab = vocab

This results in unknown tokens being output. The length of vocabulary also differs in certain places. That is, `len(vocab) ≠ len(tokz.vocab).

Tweaking vocab.json and merges.txt

vocab.json

import json
with open('tokenizer/vocab.json', 'r') as f: vocab = json.load(f)
for tok in long_toks: vocab.pop(tok)
with open('tokenizer/vocab.json', 'w') as f: json.dump(vocab, f, indent=2)

merges.txt

with open('tokenizer/merges.txt', 'r') as f: merges = L(f.readlines()[1:])

While I am able to update vocab.json, I do not know how to work with the contents of merges.txt. This file stores pairs as follows:

(#151387) ['Ä  Ä \n','Ä Ä  Ä Ä \n','i n\n','Ä  t\n','Ä Ä Ä Ä  Ä Ä Ä Ä \n','e r\n','Ä Ä  Ä \n','o n\n','Ä  a\n','r e\n'...]

I am unable to determine what special characters represent Chinese characters. Is there any way I can decode or figure out what the characters such as ‘Ġ’ and ‘â½ Ĺ \n’ represent?

Training a Tokenizer

I feel training a tokenizer might not be feasible, because of the amount of data required? I want to have the exact same Qwen2Tokenizer, save for certain tokens removed.

Backend

import json
tokz_state = json.loads(tokz.backend_tokenizer.model.__getstate__())
for tok in long_toks: del tokz_state['vocab'][tok]

from tokenizers import models
model_class = getattr(models, tokz_state.pop('type'))
tokz.backend_tokenizer.model = model_class(**tokz_state)

However, this approach results in the following error:

---------------------------------------------------------------------------TypeError                                 Traceback (most recent call last)Cell In[67], line 3
      1 from tokenizers import models
      2 model_class = getattr(models, tokz_state.pop('type'))
----> 3 tokz.backend_tokenizer.model = model_class(**tokz_state)
TypeError: argument 'merges': failed to extract enum PyMerges ('Merges | Filename')
- variant Merges (Merges): TypeError: failed to extract field PyMerges::Merges.0, caused by TypeError: 'str' object cannot be converted to 'PyTuple'
- variant Filename (Filename): TypeError: failed to extract field PyMerges::Filename.0, caused by TypeError: 'list' object cannot be converted to 'PyString'

This approach references this GitHub issue, where others also noted the same error.


I would really appreciate any pointers regarding how correctly remove tokens!

I figured out a way to do so to remove tokens from a BPE Tokenizer.

I took each pair in the merges.txt file and combined them to produce the resulting token. I then looked up the resulting token in the tokenizer’s vocabulary to obtain its input ID. I then decoded the obtained input ID to determine what each resulting merge represented. From that, I was able to filter out the undesired tokens from the merge file.

with open('tokenizer/merges.txt', 'r') as f: mrules = L(f.read().split('\n')[1:-1])
import json
with open('tokenizer/vocab.json', 'r') as f: vocab = json.load(f)

merged = L(merge.replace(' ', '') for merge in mrules)
merged_ids = L(vocab[merge] for merge in merged)
pairs = dict(zip(merged, merged_ids))

for i, _merge in enumerate(mrules):
  merge = _merge.replace(' ', '')
  idx = vocab[merge]
  if is_long(tokz.decode(idx))[0]: mrules.remove(_merge)

Then I simply removed the same undesired tokens from the tokenizer’s vocabulary.

with open('tokenizer/vocab.json', 'r') as f: vocab = json.load(f)
long_toks = {tok: idx for tok, idx in vocab.items() if is_long(tokz.decode(idx))[0]}
for tok in long_toks: vocab.pop(tok)

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.