Removing tokens from the GPT tokenizer

ajesujoba · January 30, 2023, 1:53pm

How can I remove unwanted sub-tokens from GPT vocabulary or tokenizer? I have tried an existing approach that was used for a ROBERTa kind of model as shown below (Removing tokens from the tokenizer · Issue #15032 · huggingface/transformers · GitHub). However it fails at the point of initializing the “model” component of the backend_tokenizer with the new vocabulary.

#1. Get your tokenizer and the list of tokens you want to remove

import json
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

# get all tokens with "unused" in target_tokenizer
unwanted_words = [ 'ply', 'Ġmor','Ġprovide','IC','ung','Ġparty', 'Ġexist', 'Ġmag',]


#2. Get the arguments that allowed to initialize the "model" component of the backend_tokenizer.
model_state = json.loads(tokenizer.backend_tokenizer.model.__getstate__())
print(len(model_state["vocab"]))


#3. Modify the initialization arguments, in particular the vocabulary to remove the tokens we don't want

# remove all unwanted tokens from the vocabulary
for word in unwanted_words:
    del model_state["vocab"][word]

print(len(model_state["vocab"]))


#4. Intitialize again the "model" component of the backend_tokenizer with the new vocabulary

from tokenizers import models

model_class = getattr(models, model_state.pop("type"))

tokenizer.backend_tokenizer.model = model_class(**model_state)

print(len(tokenizer.vocab))

And below is the error:

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-21-fa908d23c419> in <module>
     30 model_class = getattr(models, model_state.pop("type"))
     31 
---> 32 tokenizer.backend_tokenizer.model = model_class(**model_state)
     33 
     34 print(len(tokenizer.vocab))

TypeError: argument 'merges': failed to extract enum PyMerges ('Merges | Filename')
- variant Merges (Merges): TypeError: failed to extract field PyMerges::Merges.0, caused by TypeError: 'str' object cannot be converted to 'PyTuple'
- variant Filename (Filename): TypeError: failed to extract field PyMerges::Filename.0, caused by TypeError: 'list' object cannot be converted to 'PyString'

What other methods can I use or refer to? The original script I adapter was used for ROBERTa which uses Sentencepiece but GPT uses BPE.

ikkiren · July 29, 2024, 7:30pm

Check this repo. Probably it solves your problem

ikkiren · August 20, 2024, 10:18pm

A more detailed answer

You can check the TokenizerChanger library.

Use the TokenizerChanger class to declarate the changer:

changer = TokenizerChanger(tokenizer)

Then use the following:

changer.delete_tokens(list_of_unwanted_tokens, include_substrings)

Deletes the unwanted tokens from the tokenizer.

If include_substrings is True, all token occurrences will be deleted even in other tokens. Defaults to True.

You can check other delete functions if you need

I am the author of this open source library. The very idea of creating this library appeared due to the fact that in my scientific work I encountered the problem described above: the inability to remove tokens from the dictionary. With my answer, I’m not trying to promote myself, I just want people who are faced with this problem to know that it has a solution.

Topic		Replies	Views
Get vocabulary tokens in order to exclude them from generate function 🤗Tokenizers	2	2650	August 1, 2022
Tokenizer Saving Issues, Wrapper Issues and Push to Hub issues Beginners	3	1496	May 12, 2024
How do I remove tokens from a BPE Tokenizer's vocabulary? 🤗Tokenizers	2	616	July 3, 2024
Training GPT-2 from scratch Beginners	2	1232	August 3, 2020
Creating a Custom Token Vocabulary for GPT-2 🤗Tokenizers	1	345	January 7, 2025

Removing tokens from the GPT tokenizer

Related topics