Sometimes, we’ll have to do something like this to extend a pre-trained tokenizer:
from transformers import AutoTokenizer
from datasets import load_dataset
ds_de = load_dataset("mc4", 'de')
ds_fr = load_dataset("mc4", 'fr')
de_tokenizer = tokenizer.train_new_from_iterator(
ds_de['text'],vocab_size=50_000
)
fr_tokenizer = tokenizer.train_new_from_iterator(
ds_fr['text'],vocab_size=50_000
)
new_tokens_de = set(de_tokenizer.vocab).difference(tokenizer.vocab)
new_tokens_fr = set(fr_tokenizer.vocab).difference(tokenizer.vocab)
new_tokens = set(new_tokens_de).union(new_tokens_fr)
tokenizer = AutoTokenizer.from_pretrained(
'moussaKam/frugalscore_tiny_bert-base_bert-score'
)
tokenizer.add_tokens(list(new_tokens))
tokenizer.save_pretrained('frugalscore_tiny_bert-de-fr')
And then when loading the tokenizer,
tokenizer = AutoTokenizer.from_pretrained(
'frugalscore_tiny_bert-de-fr', local_files_only=True
)
It takes pretty long to load from %%time
in a Jupyter cell:
CPU times: user 34min 20s
Wall time: 34min 22s
I guess this is due to regex compilation for the added tokens which was also raised in Loading of Tokenizer is really slow when there are lots of additional tokens · Issue #914 · huggingface/tokenizers · GitHub
I think it’s okay since it’ll load once and the work can be done without redoing the regex compiles.