Load tokenizer from vocab file that's been read into python

MobiusLooper · August 12, 2020, 3:51pm

Hi there,

I’m trying to instantiate a tokenizer from a vocab file after it’s been read into python. This is because I want to decouple reading objects from disk from model loading, so I want to load files into python in a different way, and then use those python objects to instantiate the hugging face objects. I can do this with actual model itself like this:

with open('pytorch_model.bin', 'rb') as f:
    buffer = io.BytesIO(f.read())
    
with open('config.json', 'r') as f:
    config = DistilBertConfig.from_dict(json.load(f))
    
torch_model = torch.load(buffer, map_location=torch.device('cpu'))
model_test = DistilBertModel.from_pretrained(pretrained_model_name_or_path=None, state_dict=torch_model, config=config)

But I can’t find a way to do it with the tokenizer. Does anyone have an idea how to do this?

Cheers!

Topic		Replies	Views
Loading SentencePiece tokenizer Beginners	3	4993	October 24, 2023
Importing a DistilBertTokenizer does not work using AutoTokenizer Beginners	0	650	November 8, 2023
How to create a hugging face compatible tokenizer from a vocab file? Beginners	0	249	May 23, 2024
Loading trained model with new vocab Beginners	2	1090	April 10, 2024
Load pretrained model's tokenizer with or without vocabulary? Beginners	2	144	August 30, 2024

Load tokenizer from vocab file that's been read into python

Related topics