How to save a fast tokenizer using the transformer library and then load it using Tokenizers?

maroxtn · August 29, 2021, 9:07am

I want to avoid importing the transformer library during inference with my model, for that reason I want to export the fast tokenizer and later import it using the Tokenizers library.

On Transformers side, this is as easy as tokenizer.save_pretrained(“tok”), however when loading it from Tokenizers, I am not sure what to do.

from tokenizers import Tokenizer
Tokenizer.from_file(“tok/tokenizer.json”)

Seems to work, but it is ignoring the two other files in the directory: tokenizer_config.json and special_tokens_map.json, for that reason I believe it won’t give me the same tokens.

Is there a way to import a tokenizer using the whole directory files ? Or better, can we import a pretrained fast tokenizer from the hub ?

Thanks

zirui3 · August 30, 2021, 6:55am

you can load tokenizer from directory with from_pretrained method:

tokenizer = Tokenizer.from_pretrained("your_tok_directory")

maroxtn · August 31, 2021, 5:17pm

Thanks for your reply, but I am trying to do is load it using the Tokenizers library rather than transformers

zirui3 · September 1, 2021, 3:12am

it seems that the Tokenizer library does not directly support loading from directory,
an alternative way is to write a wrapper method that looks for tokenize.json in the directory first, then load it via the from_file method of Tokenizer :

from os.path import isfile, join
from tokenizers import Tokenizer

def load_tokenizer_from_dir(Tokenizer, your_dir):
    # check the existance of tokenizer file
    if isfile(join(your_dir, "tokenizer.json")):
        return Tokenizer.from_file("tok/tokenizer.json")
    else:
        raise ValueError("tokenier.json not exist in dir {}".format(your_dir))

the transformers library implement the from_pretrained method in similar way

maroxtn · September 1, 2021, 8:07am

Thank you for the response again!
And both tokenizers, on the transformer library and on the tokenizers library, would give exactly the same output ? (if yes, what’s the point then of tokenizer_config and special_tokens_map)

sgugger · September 1, 2021, 12:27pm

The tokenizer_config contains information that are specific to the Transformers library (like which class to use to load this tokenizer when using AutoTokenizer). As for the other files, they are generated for compatibility with the slow tokenizers. Everything you need to load a tokenizer from the Tokenizers library is in the tokenizer.json.

maroxtn · September 1, 2021, 1:17pm

Crystal clear. Thank you all for the help and assistance

danielcabal · December 14, 2022, 11:57pm

Can you save a tokenizer from transformers into the tokenizer.json format? For example, GPT2Tokenizer.save_pretrained returns vocab.json, merges.txt, etc. Can I make it save a tokenizer.json so it can be loaded from the tokenizers library instead without transformers?

Topic		Replies	Views
What's the best way to load a saved Tokenizer json into a transformers PreTrainedTokenizerFast (or other transformers tokenizer)? 🤗Transformers	3	4852	February 25, 2021
Simple Save/Load of tokenizer not working 🤗Transformers	2	1668	November 4, 2020
Issue in loading the saved tokenizer 🤗Tokenizers	1	239	January 4, 2024
Using a BertWordPieceTokenizer trained from scratch from transformers 🤗Tokenizers	2	5039	March 26, 2021
Save_pretrained() on tokenizer does not generate a tokenizer.json file 🤗Transformers	3	864	August 19, 2024

How to save a fast tokenizer using the transformer library and then load it using Tokenizers?

Related topics