How to save a fast tokenizer using the transformer library and then load it using Tokenizers?

I want to avoid importing the transformer library during inference with my model, for that reason I want to export the fast tokenizer and later import it using the Tokenizers library.

On Transformers side, this is as easy as tokenizer.save_pretrained(ā€œtokā€), however when loading it from Tokenizers, I am not sure what to do.

from tokenizers import Tokenizer
Tokenizer.from_file(ā€œtok/tokenizer.jsonā€)

Seems to work, but it is ignoring the two other files in the directory: tokenizer_config.json and special_tokens_map.json, for that reason I believe it wonā€™t give me the same tokens.

Is there a way to import a tokenizer using the whole directory files ? Or better, can we import a pretrained fast tokenizer from the hub ?

Thanks

you can load tokenizer from directory with from_pretrained method:

tokenizer = Tokenizer.from_pretrained("your_tok_directory")

Thanks for your reply, but I am trying to do is load it using the Tokenizers library rather than transformers

it seems that the Tokenizer library does not directly support loading from directory,
an alternative way is to write a wrapper method that looks for tokenize.json in the directory first, then load it via the from_file method of Tokenizer :

from os.path import isfile, join
from tokenizers import Tokenizer

def load_tokenizer_from_dir(Tokenizer, your_dir):
    # check the existance of tokenizer file
    if isfile(join(your_dir, "tokenizer.json")):
        return Tokenizer.from_file("tok/tokenizer.json")
    else:
        raise ValueError("tokenier.json not exist in dir {}".format(your_dir))

the transformers library implement the from_pretrained method in similar way

1 Like

Thank you for the response again!
And both tokenizers, on the transformer library and on the tokenizers library, would give exactly the same output ? (if yes, whatā€™s the point then of tokenizer_config and special_tokens_map)

The tokenizer_config contains information that are specific to the Transformers library (like which class to use to load this tokenizer when using AutoTokenizer). As for the other files, they are generated for compatibility with the slow tokenizers. Everything you need to load a tokenizer from the Tokenizers library is in the tokenizer.json.

1 Like

Crystal clear. Thank you all for the help and assistance :slight_smile:

Can you save a tokenizer from transformers into the tokenizer.json format? For example, GPT2Tokenizer.save_pretrained returns vocab.json, merges.txt, etc. Can I make it save a tokenizer.json so it can be loaded from the tokenizers library instead without transformers?