Tokenizer shrinking recipes

As I’ve building tiny-models for hf-internal-testing (Hugging Face Internal Testing Organization) I need to shrink/truncate the original tokenizers and the vocab in order to get truly tiny models and often it was taking quite a long time to figure out. So I reached out for help and got several great recipes which I thought I’d share here in case others might need something similar.

Anthony Moi’s version

@anthony’s tokenizer shrinker:

import json
from transformers import AutoTokenizer
from tokenizers import Tokenizer

vocab_keep_items = 5000
mname = "microsoft/deberta-base"

tokenizer = AutoTokenizer.from_pretrained(mname, use_fast=True)
assert tokenizer.is_fast, "This only works for fast tokenizers."
tokenizer_json = json.loads(tokenizer._tokenizer.to_str())
vocab = tokenizer_json["model"]["vocab"]
if tokenizer_json["model"]["type"] == "BPE":
    new_vocab = { token: i for token, i in vocab.items() if i < vocab_keep_items }
    merges = tokenizer_json["model"]["merges"]
    new_merges = []
    for i in range(len(merges)):
        a, b = merges[i].split()
        new_token = "".join((a, b))
        if a in new_vocab and b in new_vocab and new_token in new_vocab:
            new_merges.append(merges[i])
    tokenizer_json["model"]["merges"] = new_merges
elif tokenizer_json["model"]["type"] == "Unigram":
    new_vocab = vocab[:vocab_keep_items]
elif tokenizer_json["model"]["type"] == "WordPiece" or tokenizer_json["model"]["type"] == "WordLevel":
    new_vocab = { token: i for token, i in vocab.items() if i < vocab_keep_items }
else:
    raise ValueError(f"don't know how to handle {tokenizer_json['model']['type']}")
tokenizer_json["model"]["vocab"] = new_vocab
tokenizer._tokenizer = Tokenizer.from_str(json.dumps(tokenizer_json))
tokenizer.save_pretrained(".")

LysandreJik’s version

Using the recently added train_new_from_iterator suggested by @lysandre

from transformers import AutoTokenizer

mname = "microsoft/deberta-base" # or any checkpoint that has a fast tokenizer.
vocab_keep_items = 5000

tokenizer = AutoTokenizer.from_pretrained(mname)
assert tokenizer.is_fast, "This only works for fast tokenizers."
tokenizer.save_pretrained("big-tokenizer")
# Should be a generator of list of texts.
training_corpus = [
    ["This is the first sentence.", "This is the second one."],
    ["This sentence (contains #) over symbols and numbers 12 3.", "But not this one."],
]
new_tokenizer = tokenizer.train_new_from_iterator(training_corpus, vocab_size=vocab_keep_items)
new_tokenizer.save_pretrained("small-tokenizer")

but this one requires a training corpus, so I had an idea to cheat and train the new tokenizer on its own original vocab:

from transformers import AutoTokenizer

mname = "microsoft/deberta-base"
vocab_keep_items = 5000

tokenizer = AutoTokenizer.from_pretrained(mname)
assert tokenizer.is_fast, "This only works for fast tokenizers."
vocab = tokenizer.get_vocab()
training_corpus = [ vocab.keys() ] # Should be a generator of list of texts.
new_tokenizer = tokenizer.train_new_from_iterator(training_corpus, vocab_size=vocab_keep_items)
new_tokenizer.save_pretrained("small-tokenizer")

which is almost perfect, except it now doesn’t have any information about the frequency for each word/char (that’s how most tokenizers compute their vocab, which if you need this info you can fix by
having each key appearing len(vocab) - ID times, i.e.:

training_corpus = [ (k for i in range(vocab_len-v)) for k,v in vocab.items() ] 

which will make the script much much longer to complete.

But for the needs of a tiny model (testing) the frequency doesn’t matter at all.

hack the tokenizer file version

Some tokenizers can be be just manually truncated at the file level, e.g. Electra:

# Shrink the orig vocab to keep things small (just enough to tokenize any word, so letters+symbols)
# ElectraTokenizerFast is fully defined by a tokenizer.json, which contains the vocab and the ids, so we just need to truncate it wisely
import subprocess
from transformers import ElectraTokenizerFast

mname = "google/electra-small-generator"
vocab_keep_items = 3000

tokenizer_fast = ElectraTokenizerFast.from_pretrained(mname)
tmp_dir = f"/tmp/{mname}"
tokenizer_fast.save_pretrained(tmp_dir)
# resize tokenizer.json (vocab.txt will be automatically resized on save_pretrained)
# perl -pi -e 's|(2999).*|$1}}}|' tokenizer.json # 0-indexed, so vocab_keep_items-1!
closing_pat = "}}}"
cmd = (f"perl -pi -e s|({vocab_keep_items-1}).*|$1{closing_pat}| {tmp_dir}/tokenizer.json").split()
result = subprocess.run(cmd, capture_output=True, text=True)
# reload with modified tokenizer
tokenizer_fast_tiny = ElectraTokenizerFast.from_pretrained(tmp_dir)
tokenizer_fast_tiny.save_pretrained(".")

spm vocab shrinking

First clone sentencepiece into a parent dir:

git clone https://github.com/google/sentencepiece

now to the shrinking

# workaround for fast tokenizer protobuf issue, and it's much faster too!
os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"] = "python"

from transformers import XLMRobertaTokenizerFast

mname = "xlm-roberta-base"

# Shrink the orig vocab to keep things small
vocab_keep_items = 5000
tmp_dir = f"/tmp/{mname}"
vocab_orig_path = f"{tmp_dir}/sentencepiece.bpe.model" # this name can be different
vocab_short_path = f"{tmp_dir}/spiece-short.model"
# HACK: need the sentencepiece source to get sentencepiece_model_pb2, as it doesn't get installed
sys.path.append("../sentencepiece/python/src/sentencepiece")
import sentencepiece_model_pb2 as model
tokenizer_orig = XLMRobertaTokenizerFast.from_pretrained(mname)
tokenizer_orig.save_pretrained(tmp_dir)
with open(vocab_orig_path, 'rb') as f: data = f.read()
# adapted from https://blog.ceshine.net/post/trim-down-sentencepiece-vocabulary/
m = model.ModelProto()
m.ParseFromString(data)
print(f"Shrinking vocab from original {len(m.pieces)} dict items")
for i in range(len(m.pieces) - vocab_keep_items): _ = m.pieces.pop()
print(f"new dict {len(m.pieces)}")
with open(vocab_short_path, 'wb') as f: f.write(m.SerializeToString())
m = None

tokenizer_fast_tiny = XLMRobertaTokenizerFast(vocab_file=vocab_short_path)
tokenizer_fast_tiny.save_pretrained(".")

If you have other related recipes please don’t hesitate to add those in the comments below.

p.s. if you create custom models that are derivations from original ones if possible please upload the script that created the derivative with the model files, so that in the future it’s easy to update or replicate or adapt to other models. e.g. make-tiny-deberta.py · hf-internal-testing/tiny-deberta at main created hf-internal-testing/tiny-deberta · Hugging Face.

3 Likes

gpt2 seems to have a special token "<|endoftext|>" stashed at the very end of the vocab, so it gets dropped and code breaks. So I hacked it back in with:

    if "gpt2" in mname:
        new_vocab = { token: i for token, i in vocab.items() if i < vocab_keep_items-1 }
        new_vocab["<|endoftext|>"] = vocab_keep_items-1
    else:
        new_vocab = { token: i for token, i in vocab.items() if i < vocab_keep_items }
1 Like

Hi thank you for sharing such useful information!
With reference to anthony’s tokenizer shrinker, do you know how I can instead add tokens to my BPE model?

like you’d do with any other tokenizer? tokenizer.add_tokens()

Hi I’ve tried that method but it seems to produce an output that I did not intend, not sure if it is a bug that needs to be fixed.

For context, I am originally trying to add Chinese tokens to the tokenizer but for illustration purposes, I will demonstrate the “bug” in English. Chinese words are not separated by spaces and hence in the example you will see me trying to add a token that is a subword.

The code to reproduce the bug would be:

from transformers import AutoTokenizer

checkpoint = "facebook/nllb-200-distilled-600M"
tokenizer = AutoTokenizer.from_pretrained(checkpoint, src_lang = "eng_Latn", tgt_lang = "zho_Hans")
tokenizer.add_tokens(["abcd"])

sent = 'I like to walk abcdgym along the beach'
print("tokenizer: ", tokenizer.tokenize(sent))
print("tokenizer: ", tokenizer.decode(tokenizer.encode(sent)[1:-1]))

sent = 'I like to walk gymabcd along the beach'
print("tokenizer: ", tokenizer.tokenize(sent))
print("tokenizer: ", tokenizer.decode(tokenizer.encode(sent)[1:-1]))

Evidently, tokenizer.add_tokens() works well if there will always be space after the added token but it doesn’t work as intended if there isn’t space after the added token (where the tokenizer will then introduce the additional space on its own).

I read the docs and figured out it is probably because the added tokens are isolated before the tokenization algorithm is applied
Hence I wanted to add tokens directly to the BPE model

1 Like

Thank you for explaining the issue, @KhaiKit

I’d recommend to file an issue and ask for supporting this type of tokens? This use case sounds like it won’t be an isolated case to me. But I could be wrong.

Cool, I have just filed it at Tokenizer adds an additional space after the added token · Issue #28218 · huggingface/transformers · GitHub

On a side note, this issue that I pointed out can be circumvented if I add tokens directly to the Sentencepiece or BPE model. I found a neat solution introduced in a youtube video that adds tokens to a Sentencepiece model but I am not really sure how to add tokens to a BPE model without retraining it. Wondering if you could share how to?
Reason being that NLLB fast tokenizer is based on BPE while the NLLB python based tokenizer is based on Sentencepiece.

As I didn’t write that version of the code I haven’t delved into the details, but briefly looking at it I don’t see any reason why you shouldn’t be able to do that - have you tried?

Also why can’t you just pick some long word token and replace it with your new token - assuming you supply this new tokenizer with the model, shouldn’t it just work? I haven’t tried it, so this is just an idea.

Older tokenizers had it much easier - they had some 100 empty slots designed specifically for new tokens - it was trivial to extend those without needing to chnage the vocab size.