Is there a way to save a pre-compiled AutoTokenizer?

alvations · January 25, 2024, 12:25pm

Sometimes, we’ll have to do something like this to extend a pre-trained tokenizer:

from transformers import AutoTokenizer

from datasets import load_dataset


ds_de = load_dataset("mc4", 'de')
ds_fr = load_dataset("mc4", 'fr')

de_tokenizer = tokenizer.train_new_from_iterator(
    ds_de['text'],vocab_size=50_000
)

fr_tokenizer = tokenizer.train_new_from_iterator(
    ds_fr['text'],vocab_size=50_000
)

new_tokens_de = set(de_tokenizer.vocab).difference(tokenizer.vocab)
new_tokens_fr = set(fr_tokenizer.vocab).difference(tokenizer.vocab)
new_tokens = set(new_tokens_de).union(new_tokens_fr)


tokenizer = AutoTokenizer.from_pretrained(
    'moussaKam/frugalscore_tiny_bert-base_bert-score'
)

tokenizer.add_tokens(list(new_tokens))

tokenizer.save_pretrained('frugalscore_tiny_bert-de-fr')

And then when loading the tokenizer,

tokenizer = AutoTokenizer.from_pretrained(
  'frugalscore_tiny_bert-de-fr', local_files_only=True
)

It takes pretty long to load from %%time in a Jupyter cell:

CPU times: user 34min 20s
Wall time: 34min 22s

I guess this is due to regex compilation for the added tokens which was also raised in Loading of Tokenizer is really slow when there are lots of additional tokens · Issue #914 · huggingface/tokenizers · GitHub

I think it’s okay since it’ll load once and the work can be done without redoing the regex compiles.

But, is there a way to just save the tokenizer in binary form and avoid the whole regex compilation the next time?

alvations · January 25, 2024, 12:26pm

Also asked on nlp - Is there a way to save a pre-compiled AutoTokenizer? - Stack Overflow

Topic		Replies	Views
How to save my tokenizer using save_pretrained? Beginners	5	28990	August 13, 2021
How to save a fast tokenizer using the transformer library and then load it using Tokenizers? 🤗Tokenizers	7	3449	December 14, 2022
Fine-tuned BERT tokenizer taking too long to load 🤗Tokenizers	1	3431	August 23, 2022
Using a BertWordPieceTokenizer trained from scratch from transformers 🤗Tokenizers	2	4993	March 26, 2021
What's the best way to load a saved Tokenizer json into a transformers PreTrainedTokenizerFast (or other transformers tokenizer)? 🤗Transformers	3	4831	February 25, 2021

Is there a way to save a pre-compiled AutoTokenizer?

But, is there a way to just save the tokenizer in binary form and avoid the whole regex compilation the next time?

Related topics